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PACIFIC SYMPOSIUM ON BIOCOMPUTING 2024 


2024 marks the 29th Pacific Symposium on Biocomputing (PSB). We gather once again on the 
Big Island to share the latest progress and challenges in biocomputing. 2023 was a year of 
remergence for Artificial Intelligence (AI). Large language models (LLMs) and particularly 
ChatGPT brought AI into the public consciousness in a way not previously seen. At the same 
time, advances in AI have fueled great progress in the analysis of structural and functional 
molecular data, omics data sets, electronic health records, biobanks, and many other areas of 
biocomputation. LLMs themselves have clear applications in both clinical medicine and in basic 
research, and we are experiencing an explosion of creative uses of these powerful (but still 
imperfect) tools. In addition to LLMs for human language, there are powerful LLMs built on 
protein and DNA sequence which show remarkable utility in representing these molecules and 
detecting signals and correlations between sequence and structure/function. These areas provide 
a rich background for PSB 2024. Of course, not all biocomputing is AI and there is still need for 
important efforts in traditional algorithms, informatics, data science, statistics and (importantly) 
in the understanding of the social setting in which our tools are used. We are more aware than 
ever that the choice of problems to address, the representativeness of the data that we use, and the 
ways we evaluate the success of our computational artifacts should all be considered with 
intention and sensitivity to considerations of justice, autonomy, beneficence, and non- 
maleficence. 


In addition to being published by World Scientific and indexed in PubMed, the proceedings 
from all PSB meetings are available online at http://psb.stanford.edu/psb-online/. Since 
1996, all PSB papers are indexed in PubMed. These papers are routinely cited in archival 
journal articles and routinely represent important early contributions in new subfields— 
many times before there is an established literature in more traditional journals; for this reason, 
many papers have garnered hundreds of citations. 


The social media handle for PSB is @PacSymBiocomp and the hashtag for PSB 2024 is 
#PSB24. 


The efforts of a dedicated group of session organizers have produced an outstanding program. 
The sessions of PSB 2024 and their hard-working organizers are as follows: 


Artificial Intelligence in Clinical Medicine: Generative and Interactive Systems at the 
Human-Machine Interface 

Organizers: Sajjad Fouladvand, Emma Pierson, Ivana Jankovic, David Ouyang, Jonathan H. 
Chen, Roxana Daneshjou 


Digital Health Technology Data in Biocomputing: Research Efforts and Considerations 


for Expanding Access 
Organizers: Jessilyn Dunn, Michelle Holko, Chris Lunt 
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Drug-Repurposing and Discovery in the Era of “Big” Real-World Data: How the 
Incorporation of Observational Data, Genetics, and Other -omic Technologies Can Move 
Us Forward 

Organizers: Megan M. Shuey, Jacklyn N. Hellwege, Nikhil Khankari, Marijana Vujkovic, Todd 
L. Edwards 


Overcoming Health Disparities in Precision Medicine 
Organizers: Francisco M. De La Vega, Kathleen C. Barnes, Keolu Fox, Alexander Ioannidis, 
Eimear Kenny, Rasika A. Mathias, Bogdan Pasaniuc 


Precision Medicine: Innovative Methods for Advanced Understanding of Molecular 
Underpinnings of Disease 
Organizers: Yana Bromberg, Hannah Carter, Steven E. Brenner 


We are also pleased to present five workshops in which investigators with a common interest 
come together to exchange results and new ideas in a format that is more informal than the 
peer-reviewed sessions. For this year, the workshops and their organizers are: 


Large Language Models (LLMs) and ChatGPT for Biomedicine 
Organizers: Zhiyong Lu, Steven E. Brenner, Cecilia Arighi 


Practical Approaches to Enhancing Fairness, Social Responsibility and the Inclusion of 
Diverse Viewpoints in Biomedicine 
Organizers: Daphne O. Martschenko, Nicole Martinez-Martin, Meghan Halley 


Risk prediction: Methods, Challenges, and Opportunities 
Organizers: Rui Duan, Lifang He, Ruowang Li, Jason H. Moore 


Statistical Analysis of Single-Cell Protein Data 
Organizer: Brooke Fridley 


Tools for Assembling the Cell: Towards the Era of Cell Structural Bioinformatics 
Organizers: Emma Lundberg, Trey Ideker, Andrej Sali 


The PSB 2024 keynote speakers are Scott Penberthy (Science keynote) and Andrea Roth 
(Ethical, Legal and Social Implications keynote). 


Tiffany Murray has managed the peer review process and assembly of the proceedings since 
2001 and plays a key role in many aspects of the meeting. We are grateful for the support of 
the National Institutes of Health!, ISCB, and Cleveland Institute for Computational Biology. 
The Research Parasite Awards benefit from support from GigaScience, Jeff Stibel, Mr. and Mrs. 
Stephen Canon, and Drs. Casey and Anna Greene. The Research Symbiont Awards benefit from 
support from the Wellcome Trust and the DragonMaster Foundation. 
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We are particularly grateful to the PSB staff Al Conde, Paul Murray, Ryan Whaley, Mark 
Woon, BJ Morrison McKay, Cynthia Paulazzo, Jackson Miller, Heather Miller, and Nicholas 
Murray for their assistance. We also acknowledge the many busy researchers who reviewed 
the submitted manuscripts on a very tight schedule. The partial list following this preface 
does not include many who wished to remain anonymous, and of course we apologize to any 
who may have been left out by mistake. 


We look forward to a great meeting and to seeing you on the Big Island. Aloha! 


Pacific Symposium on Biocomputing Co-Chairs, 
October 9, 2023 


Russ B. Altman 
Departments of Bioengineering, Genetics, Medicine & Biomedical Data Science, Stanford 
University 


Lawrence Hunter 
Department of Pharmacology, University of Colorado Health Sciences Center 


Marylyn D. Ritchie 
Department of Genetics and Institute for Biomedical Informatics, University of Pennsylvania 


Teri E. Klein 
Departments of Biomedical Data Science & Medicine, Stanford University 
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Finally, we wish to thank the scores of reviewers. PSB aims for every paper in this volume 
to be reviewed by three independent referees. Since there is a large volume of submitted 
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of you listed below and to anyone whose name we may have accidentally omitted or who 
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'Funding for this conference was made possible (in part) by R13LM006766 from the National 
Library of Medicine. The views expressed in written conference materials or publications and by 
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Artificial Intelligence (AI) models are substantially enhancing the capability to analyze complex 
and multi-dimensional datasets. Generative AI and deep learning models have demonstrated 
significant advancements in extracting knowledge from unstructured text, imaging as well as 
structured and tabular data. This recent breakthrough in AI has inspired research in medicine, 
leading to the development of numerous tools for creating clinical decision support systems, 
monitoring tools, image interpretation, and triaging capabilities. Nevertheless, comprehensive 
research is imperative to evaluate the potential impact and implications of AI systems in healthcare. 


© 2023 The Authors. Open Access chapter published by World Scientific Publishing Company 
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At the 2024 Pacific Symposium on Biocomputing (PSB) session entitled “Artificial Intelligence in 
Clinical Medicine: Generative and Interactive Systems at the Human-Machine Interface”, we 
spotlight research that develops and applies AI algorithms to solve real-world problems in 
healthcare. 


Keywords: Artificial Intelligence, clinical medicine, decision support systems. 


1. Introduction 

Recent progress in AI has led to the development of advanced large language models (LLMs), 
image, genomic and tabular data analysis tools (Huang et al., 2023; Movva et al., 2023, 2023; 
Omiye et al., 2023; OpenAI, 2023; Singhal et al., 2023; Tate et al., 2023; Wehbe et al., 2023). 
Leveraging these AI models for real-world biomedical data analysis is critical for enhancing 
diagnostic accuracy, predicting patient outcomes, and personalizing treatment plans, ultimately 
contributing to improved patient care and health outcomes. However, systematic evaluation of the 
potentials and limitations of AI algorithms within the medical domain is crucial to ensure the 
efficacy, safety, and reliability of Al-driven healthcare solutions and interventions (Wornow et al., 
2023). 

Here, we highlight the accepted submissions for the Artificial Intelligence in Clinical Medicine: 
Generative and Interactive Systems at the Human-Machine Interface session at the Pacific 
Symposium on Biocomputing (PSB) 2024. A goal of this session is to showcase research that has 
identified a clinical need that can be addressed by AI methods. Accepted submissions include use 
cases of using generative and classical AI models for analyzing different clinical data modalities 
and for a variety of applications such as answering medical questions, medical image analysis, 


clinical note analysis, cognitive monitoring, digital twins, and other decision support systems. 


2. Artificial Intelligence in Clinical Medicine 
2.1. Medical text and clinical notes analysis 


There have recently been numerous successful applications of LLMs in ingesting medical text and 
clinical notes to extract vital information and insights for enhanced patient care. Lozano et al. 
(2024) proposed Clinfo.ai: an open-source retrieval-augmented LLM system for answering 
medical questions using scientific literature. The authors evaluated Clinfo’s performance (along 
with the performance of other question-answering systems, which the proposed method improves 
on) on a benchmark the authors made publicly available. Systems like this highlight the potential 
of large language models to help clinicians stay abreast of the enormous (and growing) medical 
literature. Jiang et al. (2024) proposed VetLLM, a large language model for predicting diagnosis 


from veterinary notes. They evaluated whether LLMs can be used to extract diagnoses from 
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unstructured veterinary notes. This approach can more easily facilitate broad veterinary research 
given that previous work often relies on customized, specialized models for each diagnosis. The 
paper revealed that, even without fine-tuning, open-source LLMs like Alpaca-7B show promising 
performance in diagnosis extraction tasks; performance is further improved when the model is 
fine-tuned on datasets of veterinary notes. 

Pakhomov et al. (2024) proposed a conversational agent for early detection of neurotoxic effects 
of medications through automated intensive observation. This paper presents an AI system for 
monitoring cognitive symptoms of neurotoxicity which can occur in response to some 
immunotherapies. The system, a conversational agent, conducts a cognitive assessment over the 
phone including both spontaneous speech and neurocognitive tests. The authors present the results 
of a pilot study. Such systems have the potential to allow for intensive monitoring of patients while 
reducing the burden on them and medical staff (since automated monitoring can be conducted 


while the patient remains at home). 


2.2. Medical image analysis 
Another compelling avenue where AI has shown promising results is in the realm of medical image 


processing. This domain has witnessed a remarkable transformation, with AI algorithms now 
capable of efficiently analyzing a wide range of medical images, including ultrasound, X-rays, 
MRI scans, and CT scans, to yield faster and more accurate diagnoses. Duffy et al. (2024) used 
convolutional neural networks (CNNs) to evaluate the performance of AI models on 2D and 3D 
cardiac ultrasound datasets. Generally recorded as 2D video data, newer ultrasound transducers 
allow the collection of 3D data that can be post processed into standard 2D view videos. Using 
previously published CNNs for echocardiography (Ouyang et al., 2020), Duffy et al. showed that 
biases in 2D data (foreshortening and off axis views) can be simulated from the 3D data and have 
important impacts on model output. 

Li et al. (2024) proposed BrainSTEAM, a practical pipeline for connectome-based fMRI analysis 
towards subject classification. This work addressed the overfitting problem in Graph Neural 
Networks (GNNs) used for analyzing structured network data. BrainSTEAM uses a spatio- 
temporal module that includes an EdgeConv GNN model, an autoencoder, and a strategy to 
dynamically segment time series signals, construct correlation networks, capture regions of interest 
(ROIs) connectivity structures, denoise data, and enhance model training. BrainSTEAM was 
evaluated on two real-world neuroimaging datasets, ABIDE for autism prediction and HCP for 
gender prediction, showing superior performance compared to existing models. This framework 
is potentially applicable to other studies for connectome-based fMRI analysis, promising enhanced 


reliability for clinical applications. Finally, recognizing the variation in human-quantified 
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phenotypes, Vukadinovic et al. (2024) show that different ways of assessing left ventricular 
ejection fraction, including variation within the range of clinician-to-clinician variability, can 
cause significant impact on downstream analyses, including genome wide association studies, 
where less precise measurements have a substantial impact on signal for genetic loci. Compared 
with sample size variation, 1% less precision in measurements resulted in the equivalent loss of 


power as a 10% decrease in cohort sample size. 


2.3. Neurobiology and cognitive function 
Prantzalos et al. (2024) presented MaTiLDA, which serves as an integrated machine learning and 


topological data analysis platform for brain network dynamics. Brain activity is recorded via 
electroencephalograms (EEGs); however, analyzing large volumes of recordings can be difficult. 
They introduced and publicly shared MaTiLDA to enable the use of machine learning with 
topological data analysis on EEG data. They then showed how their platform could be used to 
analyze EEG data from neurological disorders such as epilepsy. 

Yang et al. (2024) showed that DNNs on brain MRI images can be used to detect and distinguish 
between normal subjects and subjects with cognitive impairments like Alzhiemer’s disease. 
Javedani Sadaei et al. (2024) proposed Zoish: a novel feature selection approach leveraging 
Shapley additive values for machine learning applications in healthcare. They present a feature 
selection python package leveraging Shapley additive values to simplify feature selection for a 
variety of healthcare prediction tasks. As an illustrative example, Zoish was applied to a predictive 
model on Parkinson's progression as measured by the Montreal Cognitive Assessment (MOCA) 
and showed not only greater predictive performance overall but also improved interpretability 
compared to another feature selection method. As AI models attempt to move away from the 
“black box”, tools such as Zoish can help clinicians better understand how the models produce 


predictions 


2.4. Human-machine interface 

Moore et. al. (2024) proposed SynTwin: a graph-based approach for predicting clinical outcomes 
using digital twins derived from synthetic patients. SynTwin introduces a novel methodology for 
generating and utilizing digital twins for clinical outcome prediction in precision medicine. The 
approach begins by estimating the distance between subjects based on their features, and then uses 
these distances to construct a network. Communities of subjects are defined, and a population of 
synthetic patients is generated. Digital twins, selected from this synthetic patient population, are 
used to enhance the prediction of clinical endpoints. When applied to a population-based cancer 


registry, the SynTwin approach significantly improved the prediction of mortality compared to 
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using real data alone, demonstrating the potential of this method in advancing precision medicine 
efforts. Patel et. al. (2024) proposed optimizing computer-aided diagnosis with cost-aware deep 
learning models. They propose a deep learning computer-aided diagnosis system to address the 
common situation in healthcare in which a false negative is more serious than a false positive. 
Whereas traditional computer-aided diagnosis systems penalize both types of misclassification 
equally, the cost-aware neural net model described here shows how using cost as a hyperparameter 


can boost sensitivity while largely maintaining overall accuracy. 


3. Conclusion 

Submissions accepted at the Artificial Intelligence in Clinical Medicine: Generative and 
Interactive Systems at the Human-Machine Interface session underscore the expanding role of AI 
in clinical medicine. The array of studies, spanning from advancements in Al-driven medical text 
and clinical notes analysis to breakthroughs in medical image processing, neurobiology, and 
human-machine interfaces, highlights the potential of generative and classical AI to improve 
healthcare. The consistent theme across all submissions is the emphasis on practical, real-world 
applications, showing AI’s capability to enhance diagnostic accuracy, monitor cognitive 
symptoms, analyze diverse data types, and augment clinical decision-making processes. Despite 
these advancements, the need for identifying clinical problems and ongoing evaluation and 
assessment of AI technologies in healthcare to ensure their safety, efficacy, and reliability 
remains paramount. The works presented herein contribute significantly to this ongoing 
dialogue, showcasing both the possibilities and the remaining challenges in integrating AI into 
the healthcare landscape. 
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The quickly-expanding nature of published medical literature makes it challenging for clin- 
icians and researchers to keep up with and summarize recent, relevant findings in a timely 
manner. While several closed-source summarization tools based on large language models 
(LLMs) now exist, rigorous and systematic evaluations of their outputs are lacking. Fur- 
thermore, there is a paucity of high-quality datasets and appropriate benchmark tasks with 
which to evaluate these tools. We address these issues with four contributions: we release 
Clinfo.ai, an open-source WebApp that answers clinical questions based on dynamically 
retrieved scientific literature; we s pecify an information r etrieval and a bstractive summa- 
rization task to evaluate the performance of such retrieval-augmented LLM systems; we 
release a dataset of 200 questions and corresponding answers derived from published sys- 
tematic reviews, which we name PubMed Retrieval and Synthesis (PubMedRS-200); and 
report benchmark results for Clinfo.ai and other publicly available OpenQA systems on 
PubMedRS-200. 


Keywords: Large Language Models, Abstractive Summarization, Artificial Intelligence, Clin- 
ical Medicine, Generative AI, Interactive Systems, ChatGPT 


1. Introduction 


The aggregation and distribution of medical knowledge, facilitated by platforms such as 
PubMed or Cochrane, enables healthcare professionals and medical researchers to stay abreast 
of the latest scientific discoveries and make informed decisions based on up-to-date scientific 
evidence.! However, the staggering influx of more than 1 million papers each year into PubMed 
alone (equivalent to two papers per minute as of 2016)? highlights the daunting task of keep- 
ing up with scientific fi ndings.’ This is especially true for practicing clinicians, who face the 
challenge of keeping track of the most updated research findings in all areas related to their 
patient care duties.4 

Existing technologies fail to adequately satisfy the information needs of health care profes- 


© 2023 The Authors. Open Access chapter published by World Scientific Publishing Company and 
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sionals and researchers. In daily practice, clinicians have on average one care-related question 
for every other patient seen® and they refer to sources like PubMed or UpToDate to ob- 
tain summarized information answering these questions. Questions that cannot be answered 
within 2 to 3 minutes are often abandoned, potentially negatively impacting patient care 
and outcomes.” While systematic review (SR) articles can provide quick answers to clinical 
questions, many questions are not answerable through existing reviews. On the other hand, 
manually synthesizing findings from multiple primary sources without the help of a published 
review article can be extraordinarily time consuming. Review articles take on average 67.3 
weeks to complete,® and those written reviews may not even include the most updated re- 
search published in the literature. Question-answering tools that leverage frequently updated 
external electronic resources would enable researchers and clinicians to obtain up-to-date in- 
formation in a more efficient way that benefits scientific discovery and quality of patient 
care.™13 

In previous decades, applications that integrated clinical systems with on-line information 
to answer users’ information needs (e.g., “infobuttons” )!* were typically driven by semantic 
networks. Other works such as CHiQA proposed a combination of knowledge-based, machine 
learning, and deep learning approaches to develop a question-answering system using patient- 
oriented resources to answer consumer health questions.'° 

The new capabilities of agents powered by large language models (LLM) has acceler- 
ated the development of automated literature summarization tools. Most of these solutions 
tend to be privately developed, closed-source solutions based on retrieval-augmented!® (Ret A) 
LLMs!" (e.g. Scite,!® Elicit ,!9 Glacier MD,?° Consensus,?! OpenEvidence,”? Statpearls seman- 
tic search??). However, the paucity of publicly available technical reports describing these 
systems and the lack of appropriate guidelines, regulations, and evaluations to ensure their 
safe and responsible usage is an urgent concern.74 

This Natural Language Generation (NLG) problem has been exacerbated by a lack of (1) 
representative datasets and associated tasks, and (2) automated metrics for evaluating RetA 
LLMs on said tasks. 

Fortunately, developments in the LLM evaluation space have shown that a number of auto- 
mated metrics correlate moderately with human preference, even in domain-specific scenarios 
(including medicine).?° 2" 

Building on these advancements, we provide four contributions: 


(1) Clinfo.ai P} the first publicly available, open-source, end-to-end retrieval-augmented LLM- 
based system for querying and synthesizing the clinical literature. The system is hosted 
as a publicly available WebApp at 

(2) An open information retrieval and abstractive summarization task specification designed 
to evaluate an algorithm’s ability to both retrieve relevant information and adequately syn- 
thesize it. In the task setup, both the information retrieval and abstractive summarization 
sub-tasks are compared to gold standard (human generated but pragmatically retrieved) 


ahttps://github.com/som-shahlab/Clinfo.AI 


Pacific Symposium on Biocomputing 2024 


references and answers. Furthermore, our task is defined to truly resemble RetA deploy- 
ment conditions (enabling the evaluation of already deployed but potentially closed-source 
systems). 

(3) PubMed Retrieval and Synthesis (PubMedRS-200), a publicly available dataset of 200 
questions structured in Open QA format, paired with answers derived from systematic 
reviews and corresponding references. 

(4) Benchmark results for Clinfo.ai and other publicly available OpenQA systems on 
PubMedRS-200). 


2. Related Work 


LLMs in healthcare The remarkable performance of LLMs in the general domain has 
brought about a revolution in the field of natural language processing,?* showcasing excep- 
tional capabilities in tasks like summarization, question-answering, and NLG.”° Given their 
wide utility, researchers are now actively exploring applications of LLMs in healthcare.30-33 
Several LLMs have achieved human-level performance on numerous medical professional li- 
censing exams such as the United States Medical Licensing Exam (USMLE).*4 Other works 
have demonstrated promise in various healthcare-inspired tasks, such as automated clinical 
note generation and reasoning about public health topics.°° 3? However, NLG tasks and pub- 
licly available benchmarks that directly address true medical needs are still underrepresented 
in the literature. Such tasks and benchmarks are especially important for estimating the ca- 
pabilities and risks of LLMs in the clinical domain. 

LLMs have several documented disadvantages and risks. First, updating LLMs with new 
knowledge and information is challenging and inefficient.” Second, the training objective of 
LLMs to predict the most probable next token can cause these models to generate inaccu- 
rate information (hallucination), requiring costly and imperfect post-hoc model adjustments 
like reinforcement learning with human feedback (RLHF).°° More importantly, most popular 
consumer-facing LLMs (e.g., OpenAl’s GPT-4,?9 Meta’s Llama 2,37 Anthropic’s Claude 238) 
do not provide references pointing to their source of information, even when the model’s out- 
put is factual. This can engender distrust with users in many scientific domains, including 
healthcare. Prior work has proposed ReTA LLMs!® to solve the information provenance issue 
and have shown promising results. These ReTA LLMs do not require post-hoc model editing 
in order to incorporate new knowledge. 

Retrieval Augmentation Question Answering LLMs in Medicine Hiesinger et al.3° 
introduced Almanac, a novel LLM integrated with a vector database and calculator, designed 
to answer 130 clinical questions generated by a panel of five board-certified clinicians and 
resident physicians. The results showed that Almanac surpassed a standard LLM (GPT-4) in 
factuality, safety, and correctness, indicating that retrieval systems lead to more accurate and 
reliable responses to clinical inquiries. Soong et al.*° evaluated GPT-3.5 and GPT-4 models 
against a custom RetA LLM using a set of 19 questions. The evaluation, based solely on 
human judgments, revealed that both GPT-3.5 and GPT-4 exhibited more hallucinations in 
all 19 responses compared to the RetA model. While these works on RetA LLM systems 
represent significant progress, they suffer from at least two shortcomings: (1) they typically 
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require human evaluation, making systematic benchmarking of new systems challenging and 
unscaleable; (2) they often focus solely on evaluating an LLM’s output, disregarding the 
relevance of the information retrieved to generate an answer. Deciding which “relevant” sources 
should be summarized can be just as challenging as generating the actual summary. Hence 
there is a need for a benchmark that enables integrated evaluation of both a system’s ability 
to select relevant documents as well as its ability to summarize these documents. 


3. Materials and Methods 


3.1. Dataset Generation 


Filter by 
Title 


T Filter by 
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o Title 

o Introduction 
o Conclusion 
o References 
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Fig. 1: Schematic Representation of the Protocol for Retrieving Abstracts from PubMed and 
Generating Title-Based Questions 


PubMed is a free resource supporting search and retrieval of biomedical literature. As 
prior work has demonstrated, a large quantity of research papers available in this index are 
phrased as questions, and it is possible to structure them in a question-answer format.*!4? 
Extending this idea, we created an open information retrieval and abstractive summarization 
dataset, using SR as a proxy for inquiries of medical interest. The rationale is that SRs are 
structured reviews written by human experts which summarize the pertinent literature related 
to a question of interest in an evidence-based manner.” In writing a SR, experienced authors 
(1) screen the published literature in a systematic way and include studies in a standardized 
manner; (2) critically evaluate methodology and reported outcomes of the included studies; 
and (3) carefully extract data, summarize original research findings, and in some instances, 
conduct additional statistical analysis of extracted results from studies including randomized 
controlled trials, observational cohort studies, case series and other qualitative studies on a 
specific topic. Furthermore, SRs are extensively used to provide evidence for various purposes, 
including policy-making, clinical practice guidelines, health technology assessment, and deci- 
sion making in healthcare.*4 As SRs unify and present a comprehensive overview of a given 
subject by human experts, we chose to leverage published SRs as gold standards when building 
our database. 


11 


Pacific Symposium on Biocomputing 2024 


To populate such a dataset, we employed E-utilities, a public API to the NCBI Entrez sys- 
tem* , to access PubMed and construct question-answer pairs with their respective references. 
Figure |1| illustrates our process in detail. First, we established a comprehensive selection of 
medical specialties and subspecialties. Second, we formulated a query to retrieve Systematic 
Reviews relevant to each medical specialty/subspecialty. Upon constructing the specialty- 
specific queries and retrieving associated abstracts, we retrieved all papers structured in a 
format that can be easily converted to questions-answer pairs (as noted by Jin et al 20194!) 
namely Title, Introduction, Conclusion, and References. Third, we applied another filtering 
process, narrowing down to solely those publications whose titles included an explicit question 
(i.e., publications whose titles including question marks). The questions from these titles were 
extracted. 

Finally, two human evaluators (AL and SF) manually reviewed the retrieved questions 
and extracted an answer to each question using minimally modified text from the results and 
conclusions section of the corresponding SR abstract. Concretely, in order to generate each 
answer, the human reviewers removed from the Results and Conclusions section of the abstract 
any text describing the structure or design of the systematic review (e.g., “We used PubMed 
to retrieve 100 papers”), leaving only text that directly addressed the question extracted from 
the SR’s title. In the process, abstracts that were lacking substantive results and abstracts that 
merely described research proposals (e.g. descriptions of future work) were entirely removed. 


3.2. Clinfo.ai: An LLM Chain for Information Retrieval and Synthesis 


Our proposed RetA LLM system, Clinfo.ai, consists of a collection of four LLMs working 
conjointly (an LLM chain*®) coupled to a Search Index (either PubMed or Semantic Scholar) 
as depicted in Figure |2} Previous works have observed that very large language models (e.g., 
100B parameters or more) exhibit zero-shot reasoning capabilities, where task-specification 
prompts can be used to guide the LLM output without further fine-tuning.4%4% We leverage 
the zero-shot reasoning capabilities of two LLMs, specifically OpenAI’s GPT-3.5 and GPT-4 
models, to complete each step in the LLM chain depicted in Figure |2| All prompts used in 
each step of the chain are available in the supplemental material P] We use LangChain’s API 
to send prompts and receive outputs from GPT-3.5 and GPT-4. While different models could 
technically be used through this entry point, our experiments are limited to OpenAI’s GPT-3.5 
and GPT-4 models (snapshots gpt-3.5-turbo-0613 , gpt-4-0613 respectively). For both models, 
we employ a temperature of 0.5 and a max token generator limit of 1024. 


3.2.1. Query Generator 


In our Clinfo.ai system, the input is the question submitted by the user. Once a question is 
submitted, the primary task of the query generator (labeled “Question2Query” in Figure 
is to construct a PubMed (or Semantic Scholar) query that efficiently retrieves a substantial 
number of relevant articles pertaining to the posed question. This is achieved by instructing 
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Publ{{jed — — v 


Semantic Scholar 


Relevance 


Question2Query Classifier 


Summarization Synthesis 


Relevant 


I 
1 
1 
1 
1 
1 
I Articles 
1 


Question Answer + Citations 
Fig. 2: Clinfo.ai: A RetA LLM system for retrieving and summarizing scientific articles 


the model to incorporate the most crucial and relevant keywords that accurately represent 
the query’s context and requirements. 


Queries Used: 


(high-grade dysplasia OR carcinoma in situ) AND biliary duct margin AND extrahepatic cholangiocarcinoma AND 
prognosis 
(high-grade dysplasia OR carcinoma in situ) AND (biliary duct margin) AND (extrahepatic cholangiocarcinoma) AND 


prognosis 


Fig. 3: Query Generated by Clinfo.ai for question: “Does high-grade dysplasia/carcinoma in 
situ of the biliary duct margin affect the prognosis of extrahepatic cholangiocarcinoma?” 


3.2.2. Information Retriever 


In a similar fashion to the Dataset Generation process, we utilize the Entrez API to fetch 
abstracts from PubMed using the output generated by the Query Generator. By leveraging 
the Entrez API, we are able to programmatically access and retrieve the relevant abstracts 
that match the constructed PubMed queries. Because LLM output is stochastic and different 
queries may capture different aspects of the literature, we take the union of all papers returned 
by three LLM-generated queries (each with the same prompt but different seeds). 


3.2.3. Relevance Classifier 


Since the query generator emphasizes recall over precision (i.e., it retrieves as many potentially 
relevant articles as possible), it is crucial to classify the relevancy of the retrieved articles. To 
achieve this, we adopt an LLM-enabled binary classification approach, wherein each article 
is categorized as either relevant or not relevant to the posed question using GPT-3.5. Once 
the relevant articles are identified, we make use of the full abstract metadata of each article 
to construct their citations in the IEEE format. If more than 35 relevant articles are deemed 
relevant, the user can decide to re-rank and filter them using BM25.49 
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3.2.4. Summarization 


The penultimate step in Clinfo.ai uses an LLM to summarize each relevant abstract within 
the context of the user-submitted question. 


3.2.5. Synthesis 


In the final step of Clinfo.ai, the relevant article summaries are organized as an ordered 
list, with each number in the list corresponding to a citation. This structured list of article 
summaries is then fed to a LLM with the task of constructing a concise and informative 
summary. The LLM is also instructed to utilize only the provided article summaries and 
no other additional information, relying on the structured list of citations to reference and 
accurately attribute each finding. 


3.3. www.clinfo.at: A Clinfo.ai User Interface via Web Application 


Clinfo.ai Log out 


Search Engine: | PubMed ~] 


Prompt Architecture: | Automatic ~ 


Answer Medical Questions from Scientific 
Literature. 


l'Il find the articles needed to answer your question. 
Then I'll summarize the results. 


Ask a medical question (e.g., Does Paxlovid increase risk COVID-19 symptom rebound?) 


Fig. 4: Clinfo.ai user interface 


To facilitate interaction with our system, we developed a web application that allows users 
to submit their own questions and/or customize the prompts. The latter enables users to 
tailor the system according to their individual preferences and needs, as illustrated in Figure 
[4| The entire process provides real-time access, displaying the queries generated during the 
search (as shown in Figure B), the number of retrieved articles, a concise summary of each 
important article, and a final “ Literature S ummary” (or “Synthesis”, to distinguish it from 
the individual article summaries) accompanied by an abbreviated answer to the question 
(“TL;DR”). Additionally, the references are presented as hyperlinks, enabling users to verify 
both the validity of the reference and the information captured from it. It is possible that even 
after summarizing an article’s abstract, Clinfo.ai may not include that article in final Literature 
Summary or “TL;DR”. Nevertheless, we ensure that all relevant articles are presented to the 
user so that they can access and explore them as needed. An example of a final Literature 
Review constructed with Clinfo.ai is shown in Figure 


3.4. Task Description and Evaluation 


The task is defined in a three step manner: 
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Does high-grade dysplasia/carcinoma in situ of the biliary duct margin affect the prognosis of extrahepatic cholangiocarcinoma? 


| & Button | Ọ Button 


Literature Summary: The impact of high-grade dysplasia/carcinoma in situ (HGD/CIS) at the biliary duct margin on the prognosis of extrahepatic 
cholangiocarcinoma is a subject of debate. Studies [1,2,10,13] have suggested that HGD/CIS at the biliary duct margin can have an impact on survival, 
with some studies indicating a poorer prognosis [1,11,13], while others suggest comparable outcomes to negative margins [2,10]. Furthermore, a 
meta-analysis [8] found no significant impact of HGD/CIS on overall survival rates compared to negative margins, but a higher incidence of local 
recurrence. Yet, other studies [9,12,14] found no significant impact of HGD/CIS on survival rates. Importantly, the presence of invasive carcinoma at 
the ductal resection margins has been identified as a strong independent prognostic factor [2,10]. Additional resection has been suggested to 
improve outcomes in some cases [1,4], although one study found worse survival rates even after achieving negative margins through additional 
resection [14]. 


TL;DR: The evidence is mixed, but it suggests that high-grade dysplasia/carcinoma in situ at the biliary duct margin may impact the prognosis of 
extrahepatic cholangiocarcinoma, potentially leading to a poorer prognosis or higher incidence of local recurrence. Invasive carcinoma at the margins 
is a stronger negative prognostic factor. Additional resection may improve outcomes in some cases. 


References: 
[1] Radtke A, Konigsrainer A (2016) Surgical therapy of cholangiocarcinoma. Visc Med 32:422-426 


[2] Nagorney DM, Donohue JH, Farnell MB, et al. (1993) Outcomes after curative resections of cholangiocarcinoma. Arch Surg 128:871-879 
[3] Noji T, Okamura K, Tanaka K, Nakanishi Y, Asano T, Nakamura T, Tsuchikawa T, Hirano S. Surgical technique and results of intrapancreatic bile 


duct resection for hilar malignancy (with video).. HPB : the official journal of the International Hepato Pancreato Biliary Association. 
2018;20(12):1145-1149. 


[4] Otsuka S, Ebata T, Yokoyama Y, Mizuno T, Tsukahara T, Shimoyama Y, Ando M, Nagino M. Clinical value of additional resection of a margin- 
Fig. 5: “Literature Summary” (Synthesis) and “TL;DR” constructed with Clinfo.ai for the 
question, “Does high-grade dysplasia/carcinoma in situ of the biliary duct margin affect the 
prognosis of extrahepatic cholangiocarcinoma?” (not all references are included in figure) 


(1) Given a question, generate a query to retrieve a set of articles; 
(2) Given the provided articles, determine their relevancy to the question; 
(3) Given relevant articles, summarize the findings. 


Step (2) is evaluated based on precision and recall. Considering the set of all documents 
D, RET(D,k) denotes the set of k retrieved documents deemed relevant and REL(D, q) the set 
of all documents referenced by a SR. We define precision and recall in this context as follows: 


|RET(D,k) n REL(D,q)| 


precision = IRET(D, | (1) 
_ |RET(D,k) N REL(D, q)| 
recall = IREL(D,9)| (2) 


Step (3) is conducted using both source-free (SF) and source-augmented (SA) automated 
metrics. Source-free metrics compare a model’s output to a gold standard reference summary, 
without including any information from the articles used to generated the gold standard sum- 
mary. For our evaluation purposes, the gold standard is the human-curated answer (derived 
from conclusions and/or results of each SR). On the other hand, SA metrics additionally 
consider relevant context to evaluate the quality of model-generated outputs. For our exper- 
iments, context is constructed by concatenating a SR’s introduction, results, and conclusion 
sections. The SA metrics we employed (and the LMs they use) include UniEval”® (T5 -large), 
COMET (XLM-RoBERTa),°° and CTC Summary Consistency (BERT).*! 

UniEval is a multi-dimensional evaluator designed for summarization tasks and takes into 
account four key dimensions (and their corresponding overall average): 


e Coherence: Assesses whether the summary forms a cohesive and rational body of text; 
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e Consistency: Evaluates the factual alignment between the information presented in the 
summary and the content of the source document; 

e Fluency: Assesses the readability and linguistic fluency of a summary; 

e Relevance: Measures whether the summary contains only the important information 
from the source document. 


COMET is an evaluation metric developed to assess the quality of Machine Translation 
(MT) systems. Despite being trained on multilingual MT outputs, it performs remarkably 
well in monolingual settings, when predicting summarization output quality.°? CTC is an 
evaluation framework, based on information alignment between input, output, and context, for 
compression (e.g summary), transduction (e.g translation), and creation (e.g. conversation). 

Finally we perform an evaluation using SF metrics, including BERTScore,°? ROUGE- 
L,°* METEOR,” chrF°*® , GoogleBLEU, CTC Summary (without providing context) , and 
CharacTer.°’ The majority of these metrics have shown moderate correlation with human 
preference and are widely reported in NLG tasks.?5:26 

The multi-dimensional evaluation based on source-augmented metrics makes the assump- 
tion that an LLM+RetA model is able to (1) retrieve abstracts of works that were deemed 
relevant by an author of a SR and (2) synthesize them in a similar fashion. We acknowledge 
that if this assumption is not met, the evaluation would heavily penalize the output. Con- 
versely, if the system retrieves an article that was not considered by a SR but bears a similar 
semantic meaning to an article present in the references of a SR, the evaluation would not 
penalize the generated text. For our proposed method, both behaviors are desired. 


4. Baselines and Experiments 


Statpearl SS 


016 3 = an {——] ——— 
23| mm GPT3.5 l mmm GPT4 ae mm Clinfo Synthesis 01754 Clinfo TL;DR 
ou 0150 
aso sd 


d 
sa 
| 
sal 
| 
a 


Frequency 


02 oA 06 0.8 10 oo 02 oa 06 os 10 o0 o2 04 oE oa 10 baat) 02 oa 06 08 10 ko 02 oa 06 o8 v 02 04 o6 0.8 Lo nmg 02 0a 06 o8 
overall overall overall overall overall overall overall 


Fig. 6: UniEval Overall Score of 146 questions (unconstrained by published date) from 
PubMedRS-200 distribution across Unrestricted Search (GPT3.5 and GPT4 zero-shot per- 
formance is added) 


Using our proposed task, we evaluated the performance of GPT-4 and GPT-3.5 without 
retrieval augmentation, Clinfo.ai (our GPT-enabled RetA LLM system), and two deployed 
tools: Elicit (an AI research assistant based on LLMs, designed for facilitating literature review 
generation, accessed on 07-02-2023), and Statpearls Semantic Search (a free search tool for 
medical knowledge, accessed on 07-25-2023). While other automated literature summarization 
systems are available, at the time of this study the vast majority require a subscription to 
answer multiple questions. Additionally, a subset of these systems refused to provide an answer 
to a significant number of the PubMedRS-200 questions as posed, making evaluation for these 
systems fraught and difficult to interpret. We exclude these systems from our analysis. 
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Table 1: Performance on 146 questions from PubMedRS-200 using source-augmented (SA) 
metrics: UniEval (T5-large), COMET (XLM-RoBERTa), CTC summary (BERT) 


Unified Multi-Dimensional Evaluator (UniEval) CTC (SA) 
Model Coherence ¢ Consistency f Fluency f Relevance + Overall + COMET fî Consistency ft Avg. Length 
LLM 
GPT-3.5 0.908 (0.149) 0.694 (0.144) 0.947 (0.059) 0.939 (0.101) 0.872 (0.082) 0.676 (0.075) 0.865 (0.017) 104.834 (47.778) 
GPT-4 0.915 (0.099) 0.655 (0.145) 0.942 (0.051) 0.929 (0.078) 0.86 (0.062) 0.677 (0.075) 0.866 (0.017) 84.214 (39.772) 
LLM + RetA 


Restricted Search 

Synthesis & TL;DR 0.949 (0.065) 0.466 (0.105) 0.903 (0.104) 0.964 (0.053) 0.82 (0.055) 0.704 (0.055) 0.84 (0.014) — 205.579(46.181) 
Synthesis 0.925 (0.066) 0.394 (0.11) 0.893 (0.119) 0.939 (0.101) 0.788 (0.059) 0.693 (0.057) 0.842 (0.015) 165.814 (40.749) 
TL;DR 0.866 (0.143) 0.787 (0.161) 0.954 (0.018) 0.826 (0.159) 0.858 (0.098) 0.665 (0.078) 0.874 (0.018) 38.766 (11.682) 


Source Dropped 
Synthesis & TL;DR 0.942 (0.092) 0.465 (0.104) 0.918 (0.085) 0.962 (0.059) 0.822 (0.055) 0.706 (0.056) 0.843 (0.014) 204.248 (38.394) 
Synthesis 0.925 (0.066) 0.398 (0.112) 0.912 (0.096) 0.943 (0.055) 0.795 (0.055) 0.695 (0.059) 0.845 (0.016) 164.938 (33.221) 
TL;DR 0.829 (0.202) 0.763 (0.197) 0.953 (0.029) 0.796 (0.194) 0.835(0.13) 0.672 (0.078) 0.876 (0.017) 38.31 (10.726) 


Unrestricted Search 
Our Models 
Synthesis & TL;DR 0.945 (0.064) 0.539 (0.127) 0.912 (0.096) 0.962 (0.059) 0.84 (0.052) 0.721 (0.055) 0.852 (0.017) 214.338 (44.173) 


Synthesis 0.916 (0.092) 0.48 (0.142) 0.904 (0.098) 0.935 (0.069) 0.809 (0.06) 0.712 (0.057) 0.855 (0.019) 173.379 (38.492) 
TL;DR 0.896 (0.123) 0.81 (0.159) 0.955 (0.012) 0.857 (0.135) 0.88 (0.081) 0.681 (0.072) 0.88 (0.016) 39.959 (11.754) 
Deployed Models 

Elicit!® 0.854 (0.136) 0.352 (0.147) 0.743 (0.151) 0.902 (0.117) 0.713 (0.085) 0.7 (0.066) 0.866 (0.017) 130.566 (22.946) 
Statpearls SS? 0.753 (0.225) 0.383 (0.129) 0.93 (0.053) 0.845 (0.159) 0.728 (0.112) 0.633 (0.075) 0.841 (0.016) 118.172 (26.603) 


Lastly, since our framework generates two outputs — “TL;DR” and “Literature Summary” 
(also referred to as “Synthesis” ) — we conducted evaluations of three forms of Clinfo.ai’s out- 
put: (1) the synthesis of the articles retrieved and deemed relevant (“Synthesis”); (2) the ab- 
breviated summary distilling the proposed “Synthesis” into one or two sentences (“TL;DR”); 
(3) the combined “Synthesis” and “TL;DR”. 

We recognize that the usage of scientific literature to extract question-answer pairs comes 
with the possibility that an answer deemed correct at the time of acquisition may be incor- 
rect as new discoveries are published. To ensure that a system is not rewarded for simply 
copy-pasting the text of a retrieved source SR nor penalized when new relevant articles are 
published, we consider three evaluation regimes: 


(1) Restricted Search (RS): The retrieval process is constrained to include publications 
up to one day before the publication date. While this approach may not guarantee the 
retrieval of all publications considered important by the authors of each source systematic 
review, it effectively narrows down the search space to the subset of publications that 
could have been retrieved and deemed relevant during the review’s preparation. 

(2) Source Dropped (SD): The retrieval process can retrieve articles published both before 
and after the source systematic review. However, if the source SR is retrieved, it is removed 
from the set of relevant articles and not used in the subsequent steps of the summarization 
process. 

(3) Unrestricted Search (US) No restriction is applied; the source SR may (but need not) 
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Table 2: Performance on 146 questions from PubMedRS-200 using source-free (SF) metrics 


Model BERTScore f ROUGE-L t METEOR f chrF + GoogleBLEU t CTC (SF) t CharacTer| Avg. Length 
LLM 

GPT-3.5 0.781 (0.037) 0.165 (0.053) 0.181 (0.073) 30.2 (10.5) 0.077 (0.036) 0.575 (0.065) 0.912 (0.102) 104.834 (47.778) 
GPT-4 0.78 (0.037) 0.157 (0.049) 0.192 (0.07) 31.6 (9.06) 0.074 (0.031) 0.571 (0.064) 0.89 (0.099) 84.214 (39.772) 
LLM + RetA 


Restricted Search 

Synthesis & TL;DR 0.77 (0.028) 0.135 (0.043) 0.121 (0.055) 21.5 (9.98) 0.058 (0.03) 0.527 (0.059) 0.993 (0.029) — 205.579(46.181) 
Synthesis 0.773 (0.028) 0.141 (0.044) 0.133 (0.059) 24.3 (10.4) 0.063 (0.032) 0.533 (0.06) 0.976 (0.056) 165.814 (40.749) 
TL;DR 0.784 (0.041) 0.145 (0.068) 0.221 (0.089) 32.7 (7.67) 0.061 (0.043) 0.594 (0.068) 0.833 (0.086) 38.766 (11.682) 


Source Dropped 

Synthesis & TL;DR 0.773 (0.028) 0.136 (0.037) 0.119 (0.054) 21.4 (9.69) 0.057 (0.028) 0.53 (0.06) 0.989 (0.036) 204.248 (38.394) 
Synthesis 0.775 (0.026) 0.143 (0.038) 0.132 (0.057) 24.1 (9.91) 0.061 (0.043) 0.536 (0.06) 0.976 (0.056) 164.938 (33.221) 
TL;DR 0.787 (0.041) 0.148 (0.064) 0.218 (0.078) 33 (6.98) 0.06 (0.039) 0.6 (0.066) 0.83 (0.092) 38.31 (10.726) 


Unrestricted Search 

Our Models 

Synthesis & TL;DR 0.786 (0.029) 0.167 (0.06) 0.145 (0.073) 23.5 (11.2) 0.079 (0.046) 0.546 (0.067) 0.989 (0.036) 214.338 (44.173) 
Synthesis 0.789 (0.03) 0.178 (0.067) 0.164 (0.084) 26.7 (12) 0.088 (0.051 0.555 (0.07) 0.975 (0.065) 173.379 (38.492) 
TL;DR 0.793 (0.038) 0.169 (0.076) 0.252 (0.092) 35.5 (7.95) 0.076 (0.049) 0.61 (0.067) 0.825 (0.094) 39.959 (11.754) 


Deployed Models 
Elicit!® 0.807 (0.04) 0.218 (0.095) 0.206 (0.093) 31.6 (12.5) 0.127 (0.085) 0.596 (0.07) 0.938 (0.096) 130.566 (22.946) 
Statpearls SS?3 0.77 (0.028) 0.136 (0.037) 0.149 (0.057) 26.5 (9.8) 0.062 (0.026) 0.536 (0.06) 0.939 (0.09) 118.172 (26.603) 


Table 3: Clinfo.ai Precision and Recall on PubMedRS-200 


Evaluation Regime Precision t Recall t Source Included 
Restricted Search 0.224 (0.239) 0.057 (0.061) 0.0 (0.0) 
Source Dropped 0.186 (0.22) 0.064 (0.064) 0.0 (0.0) 


Unrestricted Search 0.162 (0.175) 0.052 (0.064) 0.965 (0.185) 


be included in the set of relevant articles retrieved by the system. Because we could not 
control the set of articles retrieved and summarized by closed-source tools like Elicit and 
Statpearls SS, they effectively fall within this evaluation regime. 


Finally, to ensure that conformity with the SD regime would not prevent direct comparison 
with the other evaluation regimes, we removed questions from all other training regimes for 
which Clinfo.ai could only retrieve the source article (resulting in zero articles remaining after 
exclusion under the SD regime). This yielded 145 SRs (80 after October 2021 and 65 before). 


5. Experimental Results and Analysis 


Is RetA associated with significant improvements in automated metric evaluation? 

As reported in previous studies,*4°9°8 both GPT-3.5 and GPT-4 without RetA demon- 
strated strong zero-shot performance using both source-augmented (Table|1) and source-free 
(Table |2) metrics. Notably, there was no substantial performance drop observed when these 
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models were presented with questions based on source SRs published after September 2021 
(Comparing Table |1| and Table S1 in the Supplement). While more studies are necessary, 
we postulate that this can be attributed to the models’ exposure to prior published works 
during training. Since SRs are built upon existing literature ranging across multiple years, 
it is plausible that the models have been trained on relevant information that aids them in 
providing accurate responses to questions based on newer research. However, comparing all 
LLM against LLM + RetA models, the inclusion of RetA leads to a slight improvement in the 
overall performance of the models when evaluated with SF and SA automated metrics, irre- 
spective of the publication date of the source SR. Previous works based on human evaluation 
have observed a similar trend, corroborating our automated evaluation framework. 

How does Clinfo.ai perform compared to other systems? 

As depicted in Table[I} Clinfo.ai exhibited better performance in overall UniEval compared 
to other RetA systems, irrespective of the chosen output strategy (Synthesis, TL;DR, or a 
concatenation of the two). This improvement in performance remained consistent regardless 
of the average length of the output, with Clinfo.ai achieving better results for both approx- 
imately 3x shorter (TL;DR) and around 2x longer outputs (Synthesis). Furthermore, this 
performance persisted across all different evaluation regimes, even when the source SR was 
dropped. This improvement amounted to at least 6.2% and at most 14.9% in UniEval Overall 
performance. These results suggest two significant points: (1) Our system is not merely copy- 
ing and pasting information from an SR review. Instead, it demonstrates a genuine ability to 
process and present the information effectively, resulting in enhanced performance compared 
to other available tools; and (2) even in the absence of a source SR, Clinfo.ai can still provide 
conclusions that are better aligned with a source SR’s conclusion (compared to tools that 
might include the source SR). 

TL;DR or Synthesis? 

Clinfo.ai TL;DR demonstrates significantly better performance compared to Synthesis and 
Synthesis & TL;DR, even though they all utilize the same relevant retrieved articles. It is worth 
noting that while Synthesis provides evidence to answer the question based on the retrieved 
articles, this evidence may not align with the original evidence reported by a Systematic Re- 
view (SR). However, the increased performance of TL;DR could be attributed to the LLM’s 
capability to correctly identify the most salient points of the relevant articles and effectively 
summarize them. On the other hand, using only source-free (SF) metrics (Table[2), Elicit per- 
forms better under BERTScore, ROUGE-L and GoogleBLEU, while Clinfo.ai TL;DR performs 
better under METEOR, chrF, CTC (SF), and CharacTer. 

These results highlight a potential limitation of automated evaluation . For instance, SF 
metrics tend to reward short responses, which may not necessarily be accurate or comprehen- 
sive. On the other hand, several SA metrics can assign the best score to considerably larger 
generations (UniEval’s Coherence and Relevance, and COMET), acknowledging their quality 
and relevance. This discrepancy in evaluation metrics raises concerns about the fair assessment 
of model performance and emphasizes the need for a comprehensive evaluation approach. 

Comparing different evaluation regimes, the best performance was observed under the Un- 
restricted Search evaluation regime, possibly due to the fact that the source SR was retrieved 
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on 96.5% of the questions. As expected given the restricted set of retrievable documents, 
Clinfo.ai’s precision was highest under the Restricted Search regime (Table [3). 


6. Conclusion 


The rapidly expanding medical literature and the capabilities of LLMs to process and sum- 
marize vast amounts of information have led to the development of several tools that utilize 
LLMs to generate on-demand summaries of published scientific literature. However, the lack of 
high-quality datasets and appropriate benchmarking tasks has hindered rigorous evaluations 
of these tools. To address this gap, we have introduced Clinfo.ai, an open-source end-to-end 
LLM-chain workflow designed to query, evaluate, and synthesize medical literature into concise 
summaries for answering questions on demand. Additionally, we introduce a unique dataset, 
PubMedRS-200, which consists of questions and answers extracted from systematic reviews, 
enabling automatic evaluation of LLM performance in Retrieval Augmentation Question An- 
swering. Our tools and benchmarking dataset are publicly available to ensure reproducibility 
and to facilitate further research in harnessing LLMs for Retrieval Augmentation Question 
Answering tasks. 


7. Limitations 


In this study, we employed automated metrics that have demonstrated moderate-to-high corre- 
lation with human preferences, but we did not explicitly solicit human preferences to evaluate 
the RetA LLM systems considered. Future work should consider including human evalua- 
tion to ensure alignment of automated metrics and human preferences. Lastly, it is worth 
noting that prior studies have reported that LLMs demonstrate the ability to generate accu- 
rate Boolean operators and syntax, effectively adhering to PubMed query formats. However, 
our observations revealed that these models also generated hallucinated MeSH terms, which 
could potentially lead to the exclusion of relevant studies. To overcome this limitation, future 
research efforts should prioritize improving the query generation process, ensuring that gener- 
ated MeSH terms are reliable and relevant for better precision and recall in medical literature 
search tasks. 
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A Conversational Agent for Early Detection of Neurotoxic Effects of 
Medications through Automated Intensive Observation 
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We present a fully automated AJ-based system for intensive monitoring of cognitive symp- 
toms of neurotoxicity that frequently appear as a result of immunotherapy of hematologic 
malignancies. Early manifestations of these symptoms are evident in the patient’s speech 
in the form of mild aphasia and confusion and can be detected and effectively treated prior 
to onset of more serious and potentially life-threatening impairment. We have developed 
the Automated Neural Nursing Assistant (ANNA) system designed to conduct a brief cog- 
nitive assessment several times per day over the telephone for 5-14 days following infusion 
of the immunotherapy medication. ANNA uses a conversational agent based on a large 
language model to elicit spontaneous speech in a semi-structured dialogue, followed by a 
series of brief language-based neurocognitive tests. In this paper we share ANNA’s design 
and implementation, results of a pilot functional evaluation study, and discuss technical 
and logistic challenges facing the introduction of this type of technology in clinical practice. 
A large-scale clinical evaluation of ANNA will be conducted in an observational study of 
patients undergoing immunotherapy at the University of Minnesota Masonic Cancer Center 
starting in the Fall 2023. 


Keywords: Large language models, artificial intelligence, speech, language, immunotherapy, 
Immune effector cell-associated neurotoxicity syndrome 


1. Introduction 


Immune effector cell-associated neurotoxicity syndrome (ICANS) represents a unique compli- 
cation of immune effector therapy particularly in patients treated with chimeric antigen recep- 
tor T-cell therapy (CAR-T) cells for hematologic malignancies. ICANS incidence varies from 
40-60% depending on specific CAR-T product and grading using a 4-point scale, with 1 being 
the mild manifestation and 4 the most severe. ICANS usually presents 3-5 days after CAR-T 
infusion and about 20% of events present at grade 3 or higher. The clinical centers adminis- 
tering approved CAR-T therapies have to comply with the Risk Evaluation and Mitigation 
Strategeis (REMS) mandated by the Food and Drug Administration (FDA). These include 
monitoring and prompt treatment of ICANS symptoms. The purpose of ICANS monitoring 
and detection after the CAR-T infusion and prompt treatment is to halt ICANS progression 
and minimize the risk of brain edema/herniation, the most feared sequelae of ICANS resulting 
in severe cognitive impairment, coma, ICU stay, intubation, and, in rare cases, death. 1+3 


© 2023 The Authors. Open Access chapter published by World Scientific Publishing Company and 
distributed under the terms of the Creative Commons Attribution Non-Commercial (CC BY-NC) 
4.0 License. 
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Clinical manifestations of ICANS typically begin with word-finding difficulty, headaches, 
confusion, dysphasia, aphasia, impaired fine motor skills resulting in agraphia, and somno- 
lence’ and, if untreated, can progress to the more severe sequalae. Expressive aphasia has 
been found to be the most specific symptom of ICANS. It starts as impaired ability to name 
objects, paraphasia errors, hesitant speech, and verbal perseveration, which can then proceed 
to global aphasia (inability to speak or respond to commands) with increasing ICANS sever- 
ity.* In fact, initial expressive aphasia is highly prevalent (86%) in patients that then go on 
to develop severe neurotoxicity. Low-grade ICANS is managed predominantly by supportive 
care or low dose dexamethasone, whereas severe ICANS is usually treated with high doses of 
corticosteroids and anakinra which can partially block the cascade of inflammation leading to 
pathology. Recently emerging clinical evidence suggests that early intervention with a short 
course of corticosteroids such as dexamethasone in patients with low-grade ICANS can resolve 
these symptoms completely and thereby prevent progression to more severe ICANS.’ However, 
administration of corticosteroids as prophylaxis of ICANS in all patients undergoing CAR-T 
therapy is not desirable as corticosteroids may have a negative impact on the effectiveness 
of CAR-T therapy itself, have short and long-term side-effects, increase risk of infections and 
therefore lower dose and short course is desirable.® 

The existing methods for detecting neurotoxicity of immunotherapy (as described in the 
National Comprehensive Cancer Network (NCCN) guidelines) consists of administering brief 
cognitive assessment tools such as the Immune Effector Cell-Associated Encephalopathy (ICE) 
Assessment Tool or the CAR-T Cell Toxicity Tool (CARTOX-10). Both are loosely based 
on the Mini-Mental State Examination (MMSE) originally developed for the diagnosis of 
dementia and include several brief cognitive instruments. The CARTOX-10 consists of the 
following 4 categories: Orientation: orientation to year, month, city, hospital, president of 
country of residence (5 pts); Naming: ability to name 3 objects (e.g., point to clock, pen, 
button) (3 pts); Writing: ability to write a standard sentence (e.g., “Our national bird is the 
bald eagle”) (1 pt); Attention: ability to count backwards from 100 by 10 (1 pt). ICE adds one 
more category to the CARTOX-10 instrument: Following commands: ability to follow simple 
commands (e.g., “Close your eyes and stick out your tongue” ) (1 pt). 

These tools are widely used for screening for ICANS, are brief and easy to use at bedside, 
and are highly specific for ICANS but lack scientifically rigorous evaluation. These tools in- 
herited low sensitivity from the MMSE on which they were based, as evidence from practice 
suggests that patients in the early stages of ICANS may pass the ICE assessment (especially 
if they are able to memorize it due to its frequent administration) while displaying some of 
the more subtle ICANS symptoms.? Another major drawback of the existing screening tools is 
that while these paper-and-pencil tests are not particularly difficult to administer and score, 
their administration requires a qualified healthcare provider and is time consuming. Since post 
CAR-T therapy follow-up requires intensive daily monitoring usually for up to 14 days, that 
introduces a significant burden on clinical personnel and healthcare resources. This routine 
practice limits the frequency and depth to which patients can be feasibly monitored with ICE 
and, consequently, may lead to missing the onset of early symptoms in between assessments. 
Using technology to help in administering and facilitating more frequent follow-up of patients 
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would be a significant advance in assuring safer CAR-T therapies by enabling earlier detection 
of more subtle symptoms. There is also an increasing trend to administer CAR-T therapy in 
the outpatient setting. At-home monitoring of ICANS symptoms is highly desirable as it of- 
fers the potentially more timely intervention while providing patients with more comfort and 
convenience. 

Early detection of ICANS would allow using lower doses of corticosteroids but would also 
require intensive monitoring of cognitive function (e.g., 3 times per day vs. the typical once 
per day frequency). Infrequent monitoring for ICANS (once a day or less) is likely to miss 
the early onset of subtle symptoms, as demonstrated by a study of 133 patients undergoing 
CAR-T therapy.'° Fifty-one of these patents developed ICANS and 27 of the 51 patients 
(53%) presented already with Grade >= 2 ICANS as the initial diagnosis. According to the 
ASTCT Consensus Grading guidelines, Grade 2 ICANS is diagnosed when the patient scores 
in the range 3-6 (out of 10 possible points). In practical terms, to score 3-6 on the ICE test, 
the patient would have to be significantly impaired (i.e., unable to tell what year, month it is, 
which city or hospital they are in, who the president is, and/or name three basic objects). The 
fact that over half of the patients with ICANS are initially diagnosed with Grade 2 or higher, 
combined with the fact that ICANS can develop in a matter of hours, indicates high likelihood 
that milder symptoms were present earlier but were missed either due to poor sensitivity of 
ICE, its relatively infrequent administration, or both. 

Limitations of the standard-of-care approaches to ICANS detection combined with the 
availability of highly effective therapy to prevent its further progression’ create the urgent 
need for a validated, low provider burden, and well-tolerated by patients solution for early 
identification of neurotoxicity. Deploying such a solution will potentially result in preventing 
an estimated 40-70% of cancer patients who are at risk of ICANS from severe and potentially 
debilitating symptoms. An effective solution will also reduce the total dose and duration of 
steroids, mitigate the steroid effect on CAR-T function and response, and can potentially 
improve CAR-T outcomes, enable easier access to CAR-T for older people, and facilitate 
outpatient administration and management after CAR-T therapy. 

In this paper, we provide a description of the design and implementation of an Auto- 
mated Neural Nursing Assistant (ANNA) system designed to address the limitations of the 
standard-of-care approaches by automating the administration and analysis of speech-based 
neurocognitive tests è. We also discuss the challenges specific to this particular clinical use case 
of intensive monitoring for cognitive changes associated with neurotoxic effects of immunother- 
apy, as well as other emerging areas where such intensive monitoring may be needed. We also 
report on the results of a small preliminary functional evaluation study designed to evaluate 
user experience with the system and collect feedback to determine areas for improvement prior 
to conducting a clinical study scheduled to begin in the Fall of 2023. 


aA live demo version of ANNA has been presented at the 2023 Interspeech symposium and is currently 
available at +1 (612)-682-6292. Note: the phone number may change over time - to obtain the current 
number for the demo, please contact the authors 
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2. System Description 


ANNA consists of a multi-platform app (iOS, Android, telephony) that administers neurocog- 
nitive tests, collects voice responses, and securely uploads them to a web service that stores 
the audio and automatically scores the tests. The implementation described in this paper op- 
erates via the telephone interface. To make the conversation as natural as possible, the system 
is implemented to work in full-duplex audio mode in which both the patient and the system 
can speak at the same time without the need for the patient to signal the end of utterances 
by pressing a button. 
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Fig. 1: Illustration ANNA system architecture and data flow. 


ANNA’s architecture illustrated in Figure 1 consists of two independent components, a 
dialogue manager and a cognitive assessment test battery. As the conversation manager goes 
through the script of the phone call, it transcribes the patient’s speech, responds with syn- 
thesized speech, plays audio, and listens for pauses and cue words from the patients speech 
to allow ANNA to take turns in conversation in a natural manner. For speech transcription 
it employs OpenAl’s state-of-the-art Whisper transcriber, which we have found to produce 
acceptable transcriptions of audio recordings from even the lowest end consumer phones. For 
speech generation we use the the pre-trained SpeechT5 model.!! Twilio > currently provides 
telephony services to ANNA, however, the Dialogue Manager can easily be reconfigured to 
accept input from and produce input for other audio recording and playback devices, allowing 
us to reuse it in our group’s other voice application projects. The Dialogue Manager cur- 
rently consists of a set of rules that are designed to walk the patient through the process of 


>https: //www.twilio.com/ 
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participating in the cognitive assessment. The conversation manager can either read directly 
from a script, which is how it conducts the word-list cognitive assessments or it can prompt a 
Large Language Model (LLM) using the patient’s last utterance to generate a response. For 
this purpose we currently use the blenderbot-400M-distill!? model and insert the responses to 
the input received from the patient at each conversation turn. Blenderbot is pre-configured 
to understand dialog and uses no prompt, responding directly to the patient’s utterance. We 
also continue to experiment with other pre-trained LLMs (limited to those that can be used 
in a local HIPAA-compliant environment) including the Vicuna Chat!® and, most recently, 
Llama2 Chat!* models. 

We developed ANNA as an easily deployed set of Docker images which can be deployed 
within an on-site server when provided with a phone number, web address, and a GPU with 
at least 24Gb of VRAM (e.g., NVIDIA RTX 3090 Ti). The current demo implementation is 
running on a server with two NVIDIA RTX 4090 cards. We have also constructed an alternative 
implementation of ANNA which does not use Docker images for any components which require 
access to a GPU, as the fully containerized application can have difficulty accessing the GPU 
in some environments. 


2.1. Spontaneous Speech and Language Elicitation 


We programmed ANNA to make a phone call to the patient’s phone (smartphone or landline) 
and administer the following tasks: a brief conversation with a conversational agent based on 
a LLM that asks the patient to describe how they are feeling and conducts a brief conversation 
on one of a set of pre-defined topics such as favorite pastime, books, movies, etc. Topics are 
currently randomly drawn from a pre-defined list without replacement to alleviate practice 
effects. 


2.2. Cognitive Testing: Word List Recall 


The conversation is followed by a series of brief cognitive tasks including a word list learning 
task in which the patient is presented with a list of 6 words and is asked to recall as many 
of these words as the patient can immediately after the presentation (immediate recall) and 
a few minutes later (delayed recall). The word list recall task is vulnerable to practice effects 
in serial testing.!° Practice effects can mask subtle cognitive changes due to early stages 
of ICANS; therefore, we developed a mechanism for generating multiple alternative lists of 
6 words to minimize the effects of repeated test administration. To ensure that the lists 
of words are roughly equivalent across multiple presentations, we developed an approach 
for automatically generating lists of words that are equivalent in their lexical properties of 
frequency, concreteness, and imageability using the MRC Psycholinguistic database.'® 


2.3. Cognitive Testing: Verbal Fluency 


Two verbal fluency tests are administered between the immediate and delayed recall tasks. 
The verbal fluency tests consist of a category fluency test in which the patient is asked to name 
as many animals as they can think of in 30 seconds, followed by a letter fluency test asking to 
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name as many words beginning with the letter “F” as they can think of also in 30 seconds. The 
verbal fluency task also suffers from practice effects; however, prior work of other researchers 
and our own preliminary data show that in these generative tasks the practice effects are 
small and plateau after several presentations in individuals with cognitive impairment.'® The 
rationale for selecting verbal fluency and list learning tasks rests on the evidence that they 
are particularly sensitive to a broad-spectrum of cognitive impairment effects caused by a 
wide variety of acute and chronic conditions including effects of medications,!” are quick to 
administer, and lend themselves well to automation. 

We selected the abbreviated versions of the list learning and verbal fluency tests to make 
them less burdensome for patients undergoing cancer treatment. The abbreviated version have 
been shown to have similar psychometric properties to their full counterparts (10 words for 
the list learning and 60 seconds for the verbal fluency tests).!°19 


2.4. Speech and Language Analysis 


The speech collected with ANNA is first subjected to automatic speech recognition to produce 
a verbatim transcript of everything the patient said during the interaction with the system. 
The current implementation of ANNA relies on the pre-trained Whisper neural transformer 
model (large-v1).?° The transcribed speech is analyzed to extract the following language char- 
acteristics: syntactic complexity and language model perplexity. Syntactic complexity is mea- 
sured using technology we previously developed to characterize language changes in patients 
with dementia.?! Measures of syntactic complexity include the mean number of clauses, var- 
ious measures of the depth of syntactic trees obtained from a syntactic constituency parser, 
and the mean syntactic dependency length obtained from a dependency parser. Perplexity 
is a measure of how many different equally most probable words can follow any given word 
based on probabilities obtained from a probabilistic or a neural language model. High mean 
perplexity computed over an utterance that did not participate in training the model indicates 
a poor fit between a language model and the text of the utterance. This measure has been 
shown to be useful for distinguishing between speech of individuals with probable Alzheimer’s 
Disease and healthy controls.?? Both the syntactic complexity and the language model per- 
plexity have been included in an attempt to capture early signs of confusion and changes in 
language patterns that have been noted in patients starting to develop ICANS. In addition 
to the language characteristics described above, we also extract the following paralinguistic 
speech characteristics: initial pause duration prior to onset of speech, mean pause duration, 
perseveration and hesitation density. These characteristics have also been noted (anecdotally) 
by oncology treatment teams as being observed in early stages of ICANS. 


2.5. Motivation for Selecting Analytical Measures 


Rigorous prior work by other investigators and by our group demonstrated that the category 
verbal fluency task (animal naming) discriminates between individuals with Alzheimer’s dis- 
ease dementia and healthy controls with sensitivity of 0.88 and specificity of 0.96.7? The list 
learning task has also been previously shown to have excellent psychometric properties for 
detecting mild (mean MMSE score = 22.1) memory impairment (sensitivity and specificity of 
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90%) and high test-retest reliability.24 In our own prior work, we have demonstrated that using 
deep neural modeling of the spontaneous speech patterns produced during a picture descrip- 
tion task is able to discriminate between Alzheimer’s disease dementia and controls with 87% 
accuracy.?* Our team has been engaged in validation studies in which we have demonstrated 
that neurocognitive tests of verbal fluency (semantic and phonemic) are highly sensitive to 
the neurotoxic effects of psychoactive medications such as lorazepam and topiramate,” as 
well as the effects of chronic traumatic encephalopathy,” effects of nicotine withdrawal?’ and 
neurodegeneration due to Alzheimer’s disease.?® While ICANS is distinctly different from the 
slowly progressing symptoms in dementia and chronic traumatic encephalopathy, it is similar 
to the rapid (on the order of hours) changes in cognition observed as a result of acute effects 
of psychoactive medications and nicotine withdrawal. 


2.6. Pilot Evaluation Study 


We conducted a small functional evaluation study of ANNA to elicit initial feedback from 
healthy individuals that could inform any further changes in system design and help us debug 
the system. We asked 10 Amazon Mechanical Turk workers to place an anonymous call to 
ANNA, interact with the system, and respond to a brief survey shown in Figure 3. 


- ANNA: Hello, this call will be recorded for research use. Because | am a robot, | can be slow sometimes. Please don't hang up if | take too long to respond or say 
things that don't make much sense. | am going to ask you a few questions. Please listen carefully and answer them in as much detail as you can. Are you ready? 

- CALLER: Yes. 

- ANNA: Okay. First, | would like to ask you how you are feeling. For example, have there been any changes in how you feel in the last few hours? 

- CALLER: No, | think | feel great. | have been enjoying the weather and | think overall, day's been going great and no changes in how | am feeling. 

- ANNA: That's great! I’m glad you're having a good day. What's the weather like where you are? 

- CALLER: Right now it is 97 degrees Fahrenheit. It's sunny. We're supposed to get triple digit 100 degree weather probably for the next seven days. But | mean, 
besides that, I'm in a cool environment and trying to limit my time outside. 

- ANNA: Thank you for sharing. Now, | would like to ask you to tell me as much as you can about your 

most memorable trip. Feel free to take your time in answering. 

- CALLER: Okay, well | would say most memorable trip is, this is a long time ago, this is back in 1999, in December of 1999, took a trip to Miami, Florida. 

- ANNA: That sounds like a lot of fun. I've never been to Florida. What did you do there? 

- CALLER: Well, went to Disney World, went to the beach, went to Miami, went to Orlando. 

- ANNA: Thank you. Now, | am going to ask you to remember a list of six words. You will hear six words being spoken to you. Please repeat each word aloud as it is 
being spoken to you. Later, | will ask you to recall all six words. Are you ready? 

- CALLER: Yeah, that's what we - it's pretty much what we did. Yes. 

- ANNA: Here is the list: turtle, spray, lion, yard, queen, land. Now repeat as many of these words as you remember and say next when you are done 

- CALLER: Turtle. Prey. Lion. Yard. Plane. Land. Turtle. Prey. Land. Farm. Trying to think of the other ones. 

- ANNA: Thank you. Now, | will give you a letter of the alphabet. | am going to ask you to name words that begin with that letter, as fast as you can. For example, if | 
give you the letter S, as in sam, you can say soft, smile, and so on. Do not use the same word with a different ending such as smiling, or smiles. Are you ready? 

- CALLER: Yes. 

- ANNA: Okay. Your letter is the letter F, as in foxtrot. Please name all the words that you can think of that begin with the letter F. You have thirty seconds. Please 
begin. 

- CALLER: Field, final, fantasy, fortune, fox, frisbee, 

- ANNA: Please stop. Now, | will give you a category. | am going to ask you to name as fast as you can all the things that belong to that category. For example, if | 
give you the category of articles of clothing, you can say shirt, or jacket, or pants, and so on. Are you ready? 

- CALLER: Yes. 

- ANNA: Okay. Your category is animals. Begin naming as many animals as you can think of. You have thirty seconds. 

- CALLER: Go ahead. 

- ANNA: Please stop. We are almost done. Just one last thing. A few minutes ago | read a list of six words to you. Please try to recall as many of these words as you 
can and say them aloud as you remember them. You have thirty seconds. Please begin. 

- CALLER: Tiger. Field prey, turtle. 

- ANNA: Thank you. This concludes our session. Until next time. Goodbye. 


Fig. 2: Example of an interaction between an Amazon Mechanical Turk worker and ANNA. 
This example shows the actual unedited transcription of the caller’s voice with the Whisper 
transcriber. 
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3. Results 


An example interaction between a functional evaluation study participants and ANNA is 
shown in Figure 2. This example shows a verbatim transcript of the interaction which illus- 
trates the performance of all ANNA components including the automatic speech recognition 
and large language models. The quantitative results of the functional evaluation are summa- 
rized in Table 1. 

The Duration column in Table 1 reflects the amount of time it took the evaluator to interact 
with ANNA and complete the evaluation survey. The mean duration for the 10 evaluators was 
11 minutes. All evaluators had a 100% approval rating on the Amazon Mechanical Turk system 
(i.e., they were approved for payment for all human intelligence tasks that they performed in 
the past). All evaluators were able to get to the end of the interaction with ANNA successfully 
(Completed column in Table 1). The mean audibility rating was 4.1 (SD: 0.74), the sensibility 
of ANNA’s responses to evaluators was rated as 3.7 (SD: 0.95), and the latency of system 
responses was rated as 2.0 (0.92). 


Table 1: Results of pilot functional evaluation. 


Evaluator Duration Completed Audibility Sensibility Latency Comments 


(sec.) Y/N (1-5) (1-5) (1-5) 

1 660 Y 5 3 1 -— 

2 669 Y 4 3 4 Good exprience 

3 660 Y 5 3 1 - 

4 557 Y 3 5 2 Make it a little bit 
faster in response. 

5 784 Y 4 5 2 -— 

6 808 Y 4 3 3 It could be a bit more 
human-like, now it 
sounds too machine 
like. 

7 782 Y 3 5 2 Just improving re- 
sponse time would be 
a huge upgrade. 

8 744 Y- 5 3 1 -— 

9 748 Y 4 3 2 GOOD 

10 588 Y 4 4 2 It should be able to 
restate the instruc- 
tions instead of just 
waiting on an affir- 
mative to being. 

Mean 700 Y 4.1 (0.74) 3.7 (0.95) 2.0 (0.92) - 

(SD) (86.18) 
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4. Discussion 


The broad clinical need that ANNA is designed to address arises from the limitations of the 
healthcare system in which intensive monitoring of patients’ cognitive function (multiple times 
per day) by a human healthcare professional is not feasible and is cost-prohibitive. Monitor- 
ing cognitive function is unlike monitoring of physiologic function in that the former requires 
symbol-mediated interaction, which is typically achieved through the use of language. Many 
years of research in language technology and artificial intelligence yielded a number of conver- 
sational agents designed for use in healthcare applications.?? However, recent developments in 
speech and language technology and, in particular, the introduction of large language models 
such as ChatGPT and Whisper can potentially move these efforts to a new level by making 
these systems simpler an more accurate in recognizing the incoming speech and producing 
more natural and flexible responses. 


Were you able to get through the entire script of the phonecall? 


(On a scale of 1-5, with 1 being slow and 5 being fast, how quick was the chatbot in responding in conversation? 


(On a scale of 1-5, with 1 being nonsense and 5 being sensible, how sensible were the things the chatbot said back to 
you? 


(On a scale of 1-5, with 1 being difficult and 5 being easy, how easy was it to hear the words that were spoken in the 
word memorization test? 


Additional comments on how we can improve our chatbot system and its user experience 


Fig. 3: Evaluation survey administered to Amazon Mechanical Turk workers. 


The proposed automated cognitive assessment methods address the limitations of the exist- 
ing manual methods by using a series of brief, validated and easy to administer neurocognitive 
tests that use speech as the input modality to detect expressive aphasia deficits - the most 
specific symptom of ICANS. The key innovative aspect of our approach is the use of AI tech- 
nologies such as LLMs and automatic speech recognition based on deep learning to convert 
spoken responses to text that can subsequently be used to compute traditional scores as well 
as novel speech and language-based measures to further improve the sensitivity and specificity 
of these instruments. The use of AI, large language models and automatic speech recognition 
and synthesis, as well as scoring algorithms tailored to the neurocognitive tests at hand, is 
what sets us apart from other commercial and academic computerized neurocognitive testing 
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approaches. To the best of our knowledge none of the current computerized approaches to 
neurocognitive assessment use conversational AI technology to elicit speech from patients and 
to analyze the resulting speech for cognitive impairment due to immunotherapy with machine 
learning. Another innovation is that we use extensively validated and recognized neurocogni- 
tive tests in a novel, accessible, and fully automated way that can also enable at-home and/or 
remote monitoring for ICANS, which could improve the accessibility of immunotherapy in 
rural and other settings away from medical centers. 

In addition to the immunotherapy used to treat certain types of cancer, other newly 
emerging therapies that leverage the immune system have been recently approved by the US 
Food and Drug Administration for treatment of Alzheimer’s disease. The most recent approval 
was granted in July 2023 to lecanemab, an immunotherapy agent that was demonstrated to 
remove Alzheimer’s disease biomarkers from the brain and significantly (albeit moderately) 
slow down the disease progression as compared to other treatments. However, serious side 
effects including brain edema in 12.6% of the participants in the active arm of the clinical trial 
of this medication were observed.®? Therefore, similarly to ICANS, early detection and clinical 
management of these side-effects in the treatment of Alzheimer’s disease may potentially 
benefit from intensive cognitive monitoring. Another potential clinical application area for 
systems like ANNA is in automating the monitoring for post-operative delirium. Proactive 
monitoring for post-operative delirium and early intervention has been shown to shorten length 
of hospital stay and improve surgical outcomes.*! 

Pending successful demonstration of ANNA’s feasibility and validity for early detection of 
ICANS, as we move outside of the realm of research and into wide adoption of ANNA in clinical 
practice, ANNA is ready to be integrated into a wide variety of clinical settings as a laboratory 
service using already existing technology and informatics standards including the Health Level 
7 (AL7 v2) and FHIR protocols to interface with EHR via the standard lab test results route. 
One of the challenging issues that we expect to face has to do with handling of critical values. 
Critical values or failure to do the ANNA assessment will need to be communicated to the 
care team verbally by phone also using a standard protocol (ISO 15189) for communicating 
critical lab values. To this end, ANNA would need to have an interface (voice or graphical) to 
enable the care team to configure the system for each individual patient. The configuration 
will need to include telephone numbers for the patient and the care team as well as some 
of the patients’ preferences (e.g., topics for the conversational part of ANNA’s assessments, 
do-not-call times, system voice and personality preferences). 


4.1. Limitations 


ANNA currently has several technical limitations. Due to the use of multiple large neural 
models, the response latency can vary between less than a second for short turns (e.g., confir- 
mations) to 3-4 seconds for longer turns in which ANNA has to convert longer input utterances 
to text and then also generate a response text and synthesize it into a spoken utterance. The 
evaluators in the pilot study clearly noted this as something that should be improved. We 
found that neural text-to-speech generation is the biggest contributor to response latency; 
however, other modules can be optimized as well. We plan to reduce the response latency in 
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the production version of ANNA by a) switching to a faster version of the Whisper model 
© which has been benchmarked to be about 5 times faster than the OpenAI version, and b) 
distributing the LLM and TTS models across multiple GPU cards. 

Another potential limitation that ANNA inherits from the pre-trained large language mod- 
els is the potential for going off-topic (a.k.a. ” hallucinating” ) during the initial conversational 
part of the assessment. To minimize this potential risk, we limit the amount of text produced 
by the models in response to the user input to 1-2 utterances. In the near future, we also 
plan to implement a set of guardrails to prevent ANNA from responding in inappropriate 
or offensive manner 4. This limitation was not noted in the pilot study as the sensibility of 
ANNA’s responses was rated as fairly high (mean 3.7 out of 5) and none of the 10 evaluators 
commented on any specific nonsensical responses. 

The racial, cultural, gender and ethnic biases learned by large language models from train- 
ing data is a major concern with applications of AI in medicine in general’? and is a potential 
concern in our application as well. Given the nature of the interactions between patients and 
ANNA and the focus on eliciting as much speech from patients as possible over as few conver- 
sational turns as possible, we do not anticipate any such biases to have a chance to manifest 
themselves in any discernible fashion to the patients. Nonetheless, since inherent bias in lan- 
guage models is a known issue and we plan to examine the data collected with ANNA for any 
signs of bias or unfairness and experiment with current de-biasing methods. 

ANNA’s use case also has a distinct strength with respect to one of the biggest known 
limitations of large language models - variable trustworthiness of the information they gen- 
erate. The lack of confidence in the information provided by these models is currently one 
of the major barriers to their adoption for clinical applications as primary sources of clinical 
knowledge.*? ANNA’s clinical use case, however, does not rely on large language models for 
knowledge. We rely on these models only to support a chatbot application used to elicit speech 
from patients for subsequent analysis and not to inform either patients or clinicians. As such, 
ANNA currently represents one of the safest and most immediate ways of using large language 
models in a clinical context. 


5. Next Steps 


We have developed and submitted an observational study protocol to the University of Min- 
nesota Institutional Review Board. In this prospective clinical study University of Minnesota 
Masonic Cancer Center patients undergoing CAR-T therapy for hematological cancers will 
be monitored for ICANS with ANNA concurrently with the standard of care ICE testing. 
The following primary endpoints will be evaluated: a) acceptability of the frequency of ANNA 
administration; b) quality and quantity of audio collected from patients; and c) naturalness 
and ease of interaction with automated ANNA assessments. As we test the central feasibil- 
ity hypothesis, we will also seek to understand the reasons why ANNA administration may 
not have occurred (examples: unable, refused, too tired, ill, forgot, technical reason, app not 


“https: //github.com/guillaumekln /faster-whisper 
https: //github.com/NVIDIA /NeMo-Guardrails 
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working, battery out, other). We will also evaluate ANNA’s usability characteristics that are 
not central to its feasibility but may affect t he feasibility indirectly s uch as naturalness of 
interactions with patients, convenience, and patients’ perceptions of ease of use. 

Prior to conducting the clinical study, we plan to address the system latency limitation 
pointed out by the pilot study evaluators as well as experiment with the more recently released 
chat models such as Llama2 to improve the sensibility of the initial conversations with the 
patient. 

We also plan to enhance the language analysis of the conversations collected with ANNA 
by adding language coherence measures using a recently developed Time-series Augmented 
Representations for Detection of Incoherent Speech (TARDIS) method that relies on a time- 
series analysis of coherence features computed using semantic relatedness between words in a 
given piece of discourse °. The TARDIS approach has been used successfully to characterise 
disordered speech in patients with schizophrenia** and may prove to be useful for detecting 
possible thought disturbances caused by early ICANS. 

One of our current concerns with using ANNA for intensive monitoring of cognitive changes 
in cancer patients is that even the abbreviated version of the cognitive tests we have currently 
implemented may present a burden for the patients who are likely to experience significant 
distress and fatigue as a result of therapy. Our ultimate goal in the forthcomeing clinical study 
is to determine if we can reliably ascertain the onset of ICANS based entirely on the analysis 
of the brief conversation between ANNA and the patient. If we can successfully do so, then 
we would likely be able to dispense with the more formal word list learning and verbal fluency 
tests, which would make intensive monitoring much less burdensome for patients. 
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Leveraging 3D Echocardiograms to Evaluate AI Model Performance in Predicting 
Cardiac Function on Out-of-Distribution Data“ 


Grant Duffy, Kai Christensen and David Ouyang 
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Email: David.OQuyang @ cshs.org 


Advancements in medical imaging and artificial intelligence (AI) have revolutionized 
the field of cardiac diagnostics, providing accurate and efficient tools for assessing cardiac 
function. AI diagnostics claims to improve upon the human-to-human variation that is 
known to be significant! ?, However, when put in practice, for cardiac ultrasound, AI models 
are being run on images acquired by human sonographers whose quality and consistency 
may vary. With more variation than other medical imaging modalities*, variation in image 
acquisition may lead to out-of-distribution (OOD) data and unpredictable performance of 
the AI tools. Recent advances in ultrasound technology has allowed the acquisition of both 
3D as well as 2D data, however 3D has more limited temporal and spatial resolution and is 
still not routinely acquired. Because the training datasets used when developing AI 
algorithms are mostly developed using 2D images, it is difficult to determine the impact of 
human variation on the performance of AI tools in the real world. The objective of this 
project is to leverage 3D echos to simulate realistic human variation of image acquisition 
and better understand the OOD performance of a previously validated AI model’. In doing 
so, we develop tools for interpreting 3D echo data and quantifiably recreating common 
variation in image acquisition between sonographers. We also developed a technique for 
finding good standard 2D views in 3D echo volumes. We found the performance of the AI 
model we evaluated to be as expected when the view is good, but variations in acquisition 
position degraded AI model performance. Performance on far from ideal views was poor, 
but still better than random, suggesting that there is some information being used that 
permeates the whole volume, not just a quality view. Additionally, we found that variations 
in foreshortening didn’t result in the same errors that a human would make. 


Keywords: 3D Echo; AI; Machine Learning; Echocardiology. 
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1. Introduction 


Echocardiography, or cardiac ultrasound, is the most prevalent imaging modality®. Cardiac 
ultrasound is able to provide an accurate, noninvasive views of the heart in real time with limited 
equipment and with high temporal resolution’. In traditional transthoracic echocardiology, a 
sonographer will acquire 2D images and videos of the heart in standard orientations or views. Two 
standard views are the apical four chamber (A4C) and apical two chamber (A2C) views which are 
both views taken along the major axis of the heart from its apex. These views are crucial for 
assessing cardiac function, diagnosing heart failure and cardiac hypertrophy !®8-!5, These two views 
are in theory only separated by a probe rotation of roughly 60 degrees, however this depends on 
sonographer judgement for the view quality and probe placement. 

Recent advances in ultrasound technology have increased the temporal and spatial resolution of 
images acquired. Wide field of view allows for 3D images to be acquired with the same probes and 
hardware, however at lower resolution”. In addition to the standard TTE views, sometimes 
additional 3D images are acquired to better characterize complex cardiac structures and provide 
holistic evaluates of cardiac form and function. Focused images of the heart valves as well as the 
left ventricle can be used to accurately assess metrics that might be challenging to measure in 2D 
images. 

One example of acquisition error in 2D images is foreshortening, where inappropriate or 
suboptimal images of the left ventricle can cause overestimation of the cardiac function!®!’. Apical 
views depend on being placed near the apex of the left ventricle, which should not contract in, 
however off-axis foreshortened views will show contraction of the left ventricle that exaggerate the 
left ventricular function. The result of this error is the underestimate of LV volume at systole and 
ultimately an overestimate of ejection fraction'’. Although foreshortening is known to be a common 
source of measurement error, it is difficult to know how prevalent it is because it is difficult to 
quantify foreshortening in 2D images. There have been attempts to automatically detect 
foreshortening using machine learning or other algorithms!*!*!°. These algorithms need to be run 
in real-time on the ultrasound machine or trained on other modalities limiting their practicality. 

Although adding 3D acquisitions to a study may add value in these cases, it also takes additional 
time and training. The result is that 3D echo images are much less prevalent. In the Cedars Sinai 
Medical Center (CSMC), apical 3D echo images are outnumbered by other video acquisitions 
roughly 11,000 to 1 making 3D echo datasets of reasonable size rare. 

There is a large, and quickly growing, body of research dedicated to AI in medicine and 
specifically cardiology. Several models aim to automate echo measurements or diagnosis! +", 
These models show promise in revolutionizing how echocardiology is performed. Because of the 
large disparity in prevalence of 2D vs 3D echos and the often-proprietary data format of 3D images, 
AI models in this field are almost exclusively trained and evaluated on 2D TTE images. 2D datasets 
curated in this way contain only images acquired by human sonographers in specific views and do 
not span the full distribution of possible echo images. 

It is known that machine learning models can perform unpredictably on out of distribution data”!. 
Training methods including data augmentations that translate, rotate and resize images attempt to 
broaden the coverage of the datasets and mitigate these risks. But these augmentations can only 
simulate the transformation of an image constrained to the 2D plane. Real ultrasound acquisitions 
can include rotations and translation in 3D. One of the main goals of AI in medicine is the mitigation 
of human error. For models that do not perform well with 3D view transformations, the performance 
of the model could be strongly dependent on the sonographer’s acquisition quality. 
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In this research, we propose methods for evaluating AI model performance on off-axis views by 
introducing realistic 3D spatial transformations to the acquisition plane in 3D volumes. Although 
3D echos remain relatively rare in the CSMC system, searching over 16 years, we curated a dataset 
of 1,528 apical 3D images. Through reverse engineering, we were able to decode the Phillips 3D 
DICOM data format these images are stored in. We developed functions for slicing 3D data into 2D 
images and simulating realistic transformations that could be introduced by sonographer motion. 
We use a deep learning image view classifier, trained specifically for this task, to find the ideal view 
to compare performance vs. distance from ideal view. 

To test these methods, we chose to evaluate the EchoNet-Dynamic model! for measurement of 
left ventricular ejection fraction (LVEF) as the downstream tasks. LVEF is the ratio of the diastolic 
LV volume to the systolic LV volume as a percentage of volume ejected. It is an important 
measurement for assessing cardiac function and heart failure!!*?3. Typically, LVEF measurements 
are made by tracing the LV for systolic and diastolic frames in an A4C view video. EchoNet- 
Dynamic is a ResNet derived regression model that was trained on 144,184 videos from SHC. These 
images are primarily of the apical-4-chamber and apical-2-chamber views. It has been well validated 
on external datasets and even a randomized clinical trial”. We evaluate the performance of this 
model on synthetically produced 2D images with simulated probe rotation, translation, and 
foreshortening to draw conclusions about the robustness of this model in the real world and 
dependence on view quality. 


2. Methods 


To realize the impact of this research, several challenges were to be overcome. One of the largest 
challenges is simply working with 3D echo. To be able to make use of the 3D echo data, we first 
needed to pull the DICOM images from the hospital dataset, reverse engineer the proprietary data 
format, and develop tools for interpreting and slicing 3D volumes. The next crucial step was to align 
the 3D volumes along standard views so that they could be analyzed together. This was done using 
a view classifier that we trained just for this project. Finally, we evaluate the performance of the 
EchoNet EF prediction model. 


2.1. Working With 3D Echo 


The 3D echo dataset used in this research is a subset of all of the echos in CSMC’s database 
between 2012 and 2022, nearly 15 million images. Of these images, 1,349 of these are 3D 
acquisitions taken in the apical position. The apical 3D echos were used because of their ability to 
generate A4C and A2C 2D views with relatively benign artifacts from the slicing process. All 3D 
echos were captured on Philips EPIQ CVx ultrasounds. A breakdown of the relative size relevant 
factors for the CSMC 2D and 3D datasets can be found in Table 1. 
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Table 1. Breakdown of CSMC image types from 2006-2022 


Dataset N Studies N Images Mean EF Frame Rate Acquisition Duration 
All Echos 369,306 14,922,383 69.10% 26.42 fps 2.75 seconds 
3D Apical Echos 1528 1,349 56.26% 18.32 fps 2.81 seconds 


Like standard 2D echos, 3D echos are stored in DICOM format. Unlike 2D echo, the data stored 
in the “pixel data” tag in the DICOMs is only a snapshot of the volume that the sonographer chose 
to capture and not the full 3D echo data. The full data is stored in a proprietary compressed format 
under other tags that we were able to reverse engineer. The decompressed data consists of voxel 
data and physical bounds for the captured volume. Unlike voxel data captured in MRI and 3D 
formats, this voxel data is not rectilinear - instead it is defined by a spherical coordinate system, as 
shown in figure Fig. 1. This coordinate system is parameterized by one linear dimension (p), and 
two rotational dimensions (@ and 0). For each of these axes, the physical bounds given in the 
DICOM define a section of a sphere containing the scanned region that called the frustum. For 
convenience, we will also be using a 3D cartesian coordinate system with the origin at the probe on 
the surface of the skin and the x axis pointing parallel to the probe into the body. 


Fig. 1. Diagram showing the 3D world and spherical 
coordinate systems. 


To generate 2D slices of 3D videos, we must first define points on a plane corresponding to the 
2D view that we wish to sample. Although there are many degrees of freedom and ways to slice a 
3D volume, we decided to constrain our slices to just 4 degrees of freedom to ensure relatively 
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realistic looking slices and clinical relevance. We first define a square region on the x-y plane 
centered at the center of the volume and whose width is the max width of the volume to ensure that 
any slice will be centered and reasonably zoomed. We rotate this plane around the x-axis and then 
translate it forward or backward through the volume. A translation of 1 corresponds to all the way 
forward through the volume and -1 corresponds to all the way backward through the volume. 
Translations of roughly -0.5 to 0.5 result in reasonable slices. Two additional degrees of freedom 
were added to simulate foreshortening. A horizontal axis is defined on the plane and the slice is 
rotated forward or backward. We found that an axis location of 30% from the top of the plane to the 
bottom is reasonable for simulating foreshortening in our dataset. 

Once we have defined the plane that we wish to slice, we then define a grid of points on that 
plane resulting in an array with a shape of (n, m, 3) where the last dimension contains the XYZ 
location of each point. We then transform these points into spherical coordinates using the following 
equations resulting in an array with the same shape but whose last dimension contains p, Q, 0. 


p = 4x? + y2+22 


Z 
= -1(4% 
gy = tan (=) 
_ y 
a= tana (2) 
Vx? +z? 


Eq. 1 
Because the spherical coordinates are aligned with the voxel data, we can obtain the voxel 
indices for each point on the plane by simply renormalizing them using the volume bounds. 


= P — Pmin 
Pmax — Pmin 
— Pmin 


Pmax — Pmin 


Eq. 2 
To generate a 2D image all we need to do is round each index to the nearest integer and lookup 
its value in the voxel data. Any indices out of bounds of the volume result in an intensity of 0. 
Although this sampling method works, the relatively low-resolution voxel data results in voxel 
artifacts due to the relatively low resolution of 3D data. To mitigate this problem, we implemented 
trilinear interpolation between voxels which results in much smoother images as shown in Fig. 2. 
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No Interplation Trilinear Interpolation 


Fig. 2. The impact using trilinear interpolation when generating 2D slices 
from 3D echo. 


2.2. View Classifier 


With the slicing algorithm that we developed, we are able to accurately simulate the motion of 
a human moving a probe around a heart, but to characterize a particular view as being a quantifiable 
rotation and translation away from an optimal view, we need to first define the optimal view. To do 
this we trained a 2D image view classifier on a standard 2D echo dataset of known standard views. 
This dataset contains 30,045 echo videos labeled as A4C, A2C, PLAX, Subcostal, or Other views 
from Stanford Healthcare (SHC). The breakdown of label frequencies can be found in Table 2. 
During training, random frames are selected from videos in the dataset. Because when running 
inference on the 3D dataset this model would encounter images unlike anything in the training 
dataset, we attempted to increase the coverage of the training dataset by adding random mirroring 
augmentation and additional labels for mirrored A4C, A2C, PLAX and Subcostal. We used a 
ResNet18”4 image classifier architecture and cross-entropy loss to train the view classifier. The view 
classifier achieved an AUC of 0.997 for both A4C and A2C views on the SHC test set. 


Table 2. Distribution of labels in the view classifier training dataset. 


View N Total N Train N Val N Test 
A4C 5,036 4,054 499 483 
A2C 3,224 2,577 318 329 
PLAX 4,059 3,239 403 417 
Subcostal 2,726 2,166 283 276 
Other 15,000 12,000 1,500 1,500 
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2.3. EF Inference 


The EchoNet model we evaluated has been shown to be accurate on several datasets and even 
in a randomized clinical trial situation, but it is not known how sensitive it is to small changes in 
view quality due to poor probe placement and foreshortening. 

We addressed this problem by running inference on slices of 3D volumes while varying the 
rotation, translation and foreshortening from the ideal view. For each 3D volume, we ran both EF 
and view inference on every combination of translations -0.5 to 0.5 and rotation 0 to 360 degrees. 
After the best A4C slice, we introduced foreshortening to this view, -40 to 40 degrees, and ran EF 
inference again. With these results, we were able to draw conclusions about the performance of the 
EF model as a function of rotation, translation, and foreshortening from the ideal A4C view. 


Human EF = 66.00% A4C Activation 
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Fig. 3. Phase diagram showing A4C EF prediction and view activation for 
every combination of rotation and translation. The human measured EF for this 
patient is 66%. 


3. Results 


We constrained the slice degrees of freedom to rotation and translation and generated view and 
EF predictions for every combination of rotation and translation. These predictions were then plotted 
as a 2D image that summarizes how the model predictions change as the slices are rotated and 
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Fig. 4. MAE performance across dataset as view is rotated and translated. 


45 


Pacific Symposium on Biocomputing 2024 


translated shown in Fig. 3. In these plots we can see that regions of high activation for A4C 
correspond to regions of more accurate EF predictions. The point of maximum A4C activation on 
this plot for each example is considered to be the optimal view for subsiquent analysis. 
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Fig. 5. EF Error distribution for best view, 90 degrees of 
rotation, and a translation of 0.5. 


When EF inference was run on optimal view slices, the mean absolute error (MAE) was 7.3 
(7.0-7.7%). Although this is worse than the claimed performance of this model (6.3% comparing 
model to human or 2.8% comparing model to final value in clinical trial)”, it is consistent with 
interobserver variability and the variability between 2D and 3D echo*”>. As shown in Fig. 4, when 
we introduce either rotation or translation to the slice, the error increases. The MAE for rotation 
increases to 10.9% (10.6-11.1%) while the MAE for translation increases to 14.7% (14.6-14.9%) 
suggesting that there is more information being used near the center of the volume than near the 
edges as represented by slices with larger translation error. 

One characteristic we noticed was a relatively high frequency of low error, regardless of view 
quality, especially for patients with near normal EF. This led us to hypothesize that when faced with 
a poor view, the model makes a guess near the mean of the dataset. We investigate this hypothesis 
by looking at the prediction trends in various situations. In Fig. 5, we compare the EF prediction 
distributions of 90-degree rotations and translations of 0.5 to the ideal view slices. We can see that 
when the view is near ideal, the distribution is relatively tight, and centered around zero. For both 
introduced rotation and translation, we see that the distributions are shifted to the left, corresponding 
to underestimates of EF on poorly oriented views. This underestimate cannot be explained by a 
difference in mean LVEF for the EchoNet training set compared to our 3D dataset. Both datasets 
have mean LVEF values of roughly 55%!. 
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We also analyzed the subset of patients with human measured EF of greater than 70% and the 
subset of EF less than 30%. In these subsets, the increase in MAE due to introduced rotation and 
translation is much greater as shown in Fig. 6. This is because for patients with extremely abnormal 
EF, the model is not able to achieve high accuracy predictions when the view is poor by predicting 
a value near the mean. For these patients, this effect is stronger than the tendency of the model to 
underpredict. Therefore, for patients with an LVEF < 30%, the model tends to overpredict EF when 
the view is poor. An interesting consequence of these two effects is that for low EF patients, there 
is a threshold where increasing translation decreases error because low EF predictions are nearer to 
the human measurements for these patients. Fig. 7 illustrates how EF and A4C predictions vary with 
rotation and translation for a patient with a high human measured EF. 
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Fig. 6. EF model performance for the >70% and <30% EF subsets. 


When looking at foreshortening specifically, we might expect the model to overpredict EF if it 
calculates EF in the same way as human sonographers, but we see a similar trend as with rotation 
and translation. This suggests that when predicting EF, the AI model is not segmenting the LV and 
calculating LV volume to determine EF the way a sonographer would. Fig. 8 shows the results for 
varying foreshortening from ideal views. 
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Fig. 7. Example slices and predictions for a range of (a.) rotations and (b.) 
translations for a selected example volume. 
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Fig. 8. Performance figures for slices with introduced foreshortening. 


4. Discussion 


This work demonstrates how 3D echos can be used to evaluate the performance of AI models 
on realistically out-of-distribution data that these models would likely encounter in real world 
applications. Understanding distribution shifts and model performance in real world applications 
may be necessary to understand how AI truly performs in clinical practice, a major barrier in AI 
research adoption in medicine?” . We presented the methods used for interpreting and utilizing 3D 
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echo data and evaluated the performance of an established AI model predicting LVEF with these 
methods. 

We found that the EF model we evaluated performed well when the ideal slice is viewed, but 
error increases as we introduce rotation, translation, or foreshortening. The overall behavior of the 
model when subjected to OOD data is to guess a value, usually a little below the mean of the training 
dataset. This overall result of this is a tendency to underestimate EF when the view is poor. The 
model tends to overpredict EF for patients with very low EF and underestimate EF for patients with 
very high EF. These trends extent to foreshortening where humans would overestimate EF. 
Although it makes intuitive sense for the model to guess somewhere near the mean of the dataset 
when faced with OOD data, the mechanism causing underestimates for OOD data would require 
further investigation to explain. We hypothesize that the model is gauging the overall amount of 
motion in the heart to predict EF and for poor views there is a lack of apparent motion, thus the 
videos look more similar to ones of patients with low EF. 

The performance of the EF model even on ideal view slices from 3D echo has lower performance 
than on 2D videos in prior work. There are several factors that may contribute to this error. First, 
3D echo has fundamentally lower spatial and temporal resolution. While the frame rate of standard 
2D echos is usually around 30-50 frames per second, 3D echos are much slower, in the range of 13- 
24 frames per second, with higher framerates associated with lower spatial resolution. Second, the 
3D dataset might be comprised of a different distribution of patients than the general population due 
to selection bias for patients needing additional 3D echos. This is likely, given the average EF of 
the 3D dataset is 13% lower than the overall CSMC population. Finally, the view classifier we use 
to find the “ideal” slice is not perfect. It is trained on a dataset of human acquired images that aren’t 
always perfect. Our classifier also only has 4 standard views when in reality there are many more 
views and several different views may have been grouped together under “A4C”. Like the EchoNet 
model, the view classifier was only trained on 2D images and performance on OOD 3D slices might 
not be reliable. This would result in the ideal slice for predicting EF not being found. 

There is significant opportunity for future research in this field with the use of 3D echo data. An 
improved view classifier would allow more accurate identification of ideal view orientation. For 
models trained on clinical 2D datasets, like the EchoNet-Dynamic dataset, it is difficult to quantify 
the amount of foreshortening and perturbances present. Future work could use 3D echo data to train 
a model that is able to predict the amount of foreshortening or perturbance in a 2D slice. This would 
allow us to retrospectively evaluate the view quality and distribution of datasets models are trained 
on. Additionally, with better tools to simulate and evaluate 3D distribution shifts, there is an 
opportunity to develop new data augmentations and normalization techniques addressing the spatial 
nature of echocardiology. Ultimately, as 3D echo data becomes more prevalent, future models could 
use these techniques to train on 2D slices of 3D data in addition to standard 2D views. These 
proposed methods would further our understanding and improve the robustness of AI models in 
echocardiology. 

When black box AI models are deployed in healthcare, clinicians may have no sense of whether 
a model is performing within its operating domain and could lead to either overreliance or mistrust 
of the AI. In this study, we show how relatively subtle changes to the input data can significantly 
impact model performance. This has significant impact as with more AI models getting integrated 
into healthcare systems, it is important to consider how the deployment environment can be different 
from the environment they were trained and validated in. We show how identifying, simulating, and 
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evaluating these hypothetical distribution shifts can lead to a better understanding of our AI systems 
and their performance in the real world. 


50 


Pacific Symposium on Biocomputing 2024 


References 


l. 


2. 


19. 


20. 


Ouyang, D. et al. Video-based AI for beat-to-beat assessment of cardiac function. Nature 580, 
252-256 (2020). 

He, B. et al. Blinded, randomized trial of sonographer versus AI cardiac function assessment. 
Nature 616, 520-524 (2023). 

Johnson, K. W. et al. Artificial Intelligence in Cardiology. J. Am. Coll. Cardiol. 71, 2668—2679 
(2018). 

Farsalinos, K. E. et al. Head-to-Head Comparison of Global Longitudinal Strain Measurements 
among Nine Different Vendors: The EACVI/ASE Inter-Vendor Comparison Study. J. Am. Soc. 
Echocardiogr. 28, 1171—1181, e2 (2015). 

Hung, J. et al. 3D echocardiography: a review of the current status and future directions. J. Am. 
Soc. Echocardiogr. 20, 213—233 (2007). 

Papolos, A., Narula, J., Bavishi, C., Chaudhry, F. A. & Sengupta, P. P. U.S. Hospital Use of 
Echocardiography: Insights From the Nationwide Inpatient Sample. J. Am. Coll. Cardiol. 67, 
502-511 (2016). 

Feigenbaum, H. Evolution of echocardiography. Circulation 93, 1321—1327 (1996). 

Ziaeian, B. & Fonarow, G. C. Epidemiology and aetiology of heart failure. Nat. Rev. Cardiol. 
13, 368-378 (2016). 

WRITING COMMITTEE MEMBERS et al. 2013 ACCF/AHA guideline for the management 
of heart failure: a report of the American College of Cardiology Foundation/American Heart 
Association Task Force on practice guidelines. Circulation 128, e240-327 (2013). 


. Heidenreich, P. A. et al. Forecasting the future of cardiovascular disease in the United States: a 


policy statement from the American Heart Association. Circulation 123, 933—944 (2011). 


. Koh, A. S. et al. A comprehensive population-based characterization of heart failure with mid- 


range ejection fraction. Eur. J. Heart Fail. 19, 1624—1634 (2017). 


. Shah, K. S. et al. Heart Failure With Preserved, Borderline, and Reduced Ejection Fraction: 5- 


Year Outcomes. J. Am. Coll. Cardiol. 70, 2476—2486 (2017). 


. Foppa, M., Duncan, B. B. & Rohde, L. E. P. Echocardiography-based left ventricular mass 


estimation. How should we define hypertrophy? Cardiovasc. Ultrasound 3, 17 (2005). 


. Angeli, F. et al. Day-to-day variability of electrocardiographic diagnosis of left ventricular 


hypertrophy in hypertensive patients. Influence of electrode placement. J. Cardiovasc. Med. 7, 
812-816 (2006). 


. Ghorbani, A. et al. Deep learning interpretation of echocardiograms. NPJ Digit Med 3, 10 


(2020). 


. Poon, J., Leung, J. T. & Leung, D. Y. 3D Echo in Routine Clinical Practice - State of the Art in 


2019. Heart Lung Circ. 28, 1400-1410 (2019). 


. Ünlü, S. et al. Impact of apical foreshortening on deformation measurements: a report from the 


EACVI-ASE Strain Standardization Task Force. Eur. Heart J. Cardiovasc. Imaging 21, 337— 
343 (2020). 


. Kim, W.-J. C. et al. Automated Detection of Apical Foreshortening in Echocardiography Using 


Statistical Shape Modelling. Ultrasound Med. Biol. 49, 1996—2005 (2023). 

Labs, R. B., Zolgharni, M. & Loo, J. P. Echocardiographic image quality assessment using deep 
neural networks. arXiv [eess.IV] (2022). 

Poplin, R. et al. Prediction of cardiovascular risk factors from retinal fundus photographs via 
deep learning. Nat Biomed Eng 2, 158—164 (2018). 


51 


21. 


22. 


23. 


24. 


25. 


26. 


27. 


Pacific Symposium on Biocomputing 2024 


Sehwag, V. et al. Analyzing the Robustness of Open-World Machine Learning. in Proceedings 
of the 12th ACM Workshop on Artificial Intelligence and Security 105—116 (Association for 
Computing Machinery, 2019). 

Chioncel, O. et al. Epidemiology and one-year outcomes in patients with chronic heart failure 
and preserved, mid-range and reduced ejection fraction: an analysis of the ESC Heart Failure 
Long-Term Registry. Eur. J. Heart Fail. 19, 1574-1585 (2017). 

Malm, S., Frigstad, S., Sagberg, E., Larsson, H. & Skjaerpe, T. Accurate and reproducible 
measurement of left ventricular volume and ejection fraction by contrast echocardiography: a 
comparison with magnetic resonance imaging. J. Am. Coll. Cardiol. 44, 1030-1035 (2004). 
He, K., Zhang, X., Ren, S. & Sun, J. Deep Residual Learning for Image Recognition. arXiv 
[cs.CV] (2015). 

Yuan Neal et al. Systematic Quantification of Sources of Variation in Ejection Fraction 
Calculation Using Deep Learning. JACC Cardiovasc. Imaging Q,. 

Gerke, S., Babic, B., Evgeniou, T. & Cohen, I. G. The need for a system view to regulate 
artificial intelligence/machine learning-based software as medical device. NPJ Digit Med 3, 53 
(2020). 

Wu, E. et al. How medical AI devices are evaluated: limitations and recommendations from an 
analysis of FDA approvals. Nat. Med. 27, 582-584 (2021). 


52 


Pacific Symposium on Biocomputing 2024 


BrainSTEAM: A Practical Pipeline for Connectome-based fMRI Analysis 
towards Subject Classification 


Alexis Li 


Hamilton High School, 
E-mail: li.aleris1111@gmail.com 


Yi Yang 


Duke University, 
E-mail: owen. yang @duke. edu 


Hejie Cui 
Department of Computer Science, Emory University, 
E-mail: hejie.cui@emory.edu 


Carl Yang 


Department of Computer Science, Emory University, 
E-mail: j.carlyang@emory.edu 


Functional brain networks represent dynamic and complex interactions among anatomical 
regions of interest (ROIs), providing crucial clinical insights for neural pattern discovery 
and disorder diagnosis. In recent years, graph neural networks (GNNs) have proven im- 
mense success and effectiveness in analyzing structured network data. However, due to the 
high complexity of data acquisition, resulting in limited training resources of neuroimaging 
data, GNNs, like all deep learning models, suffer from overfitting. Moreover, their capability 
to capture useful neural patterns for downstream prediction is also adversely affected. To 
address such challenge, this study proposes BrainSTEAM, an integrated framework featur- 
ing a spatio-temporal module that consists of an EdgeConv GNN model, an autoencoder 
network, and a Mixup strategy. In particular, the spatio-temporal module aims to dynam- 
ically segment the time series signals of the ROI features for each subject into chunked 
sequences. We leverage each sequence to construct correlation networks, thereby increasing 
the training data. Additionally, we employ the EdgeConv GNN to capture ROI connectiv- 
ity structures, an autoencoder for data denoising, and mixup for enhancing model training 
through linear data augmentation. We evaluate our framework on two real-world neuroimag- 
ing datasets, ABIDE for Autism prediction and HCP for gender prediction. Extensive ex- 
periments demonstrate the superiority and robustness of BrainSTEAM when compared to 
a variety of existing models, showcasing the strong potential of our proposed mechanisms 
in generalizing to other studies for connectome-based fMRI analysis. 


Keywords: Brain Connectome Analysis; Neuroimaging Studies; Synthetic Data Generation 
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1. Introduction 


Functional brain networks illustrate the dynamic connectivity patterns between anatomical 
regions of interest (ROIs) for different cognitive states and different responses to disease or in- 
jury.! The study of functional brain networks provides insights into the underlying mechanisms 
of human consciousness, developmental processes, and the neural bases of various neurological 
and psychiatric disorders such as autism, ADHD, depression, and schizophrenia.? However, 
existing computational tools often extract a single static graph structure based on correlations 
among full BOLD signals, which ignores the dynamic changes of functional connectivity.? 5 

Compared with other deep learning paradigms such as Convolutional Neural Networks 
(CNNs),® and Recurrent Neural Networks (RNNs),’ Graph Neural Networks (GNNs)®* pro- 
vide unique benefits in functional brain network analysis due to its capability in modeling 
connectivity structures.!™17 However, most GNN-based frameworks resort to static correla- 
tion networks as data instances, and they are prone to unstable performances due to large 
data noises in the BOLD signals and overfitting due to limited data labels of clinical outcomes. 
This is especially true for the ABIDE dataset as the images come from 17 international sites 
with differing imaging protocol, as well as heterogeneity within the dataset.'® 

To address the challenges above, this study proposes BrainSTEAM, an integrated pipeline 
that features a spatio-temporal module, for brain connectome analysis on dynamic fMRI 
networks. Specifically, we propose a temporal chunking approach to dynamically segment the 
BOLD signals of each subject into partitioned sequences based on a tunable sliding window 
to capture the local connectivity structures at different scales, which are further modeled 
by EdgeConv. An autoencoder is devised to discover the important connectivity patterns 
during ROI pooling through learnable dropout, where the objective is to reconstruct the full 
connectivity patterns only based on the important ones. Mixup is applied to further stabilize 
and enhance training of the whole framework through linear data augmentation to prevent 
the model from memorizing certain data points. 

Extensive experiments conducted in this study demonstrate that our BrainSTEAM model 
outperforms state-of-the-art models on both mental disorder prediction and gender classifica- 
tion, indicating its effectiveness in modeling functional brain networks and also highlighting 
its flexibility and versatility. It is also promising to apply BrainSTEAM to the analysis of 
functional brain networks for other clinical applications, as well as other dynamic graphs ex- 
tracted from time-series data. For clinical applications in particular, this model would help to 
address the limitations of MRI data collection as there are limited scans due to the expensive 
nature of MRIs and constant exposure for the patient. Decreasing information loss can make 
the model more robust, providing a more reliable aid for those in a clinical setting. 


2. Related Work 
2.1. Data Augmentation 


Mixup utilizes the principles of vicinal risk minimization across different classes, constructing 
new data as a combination of existing data points.‘? Graph Mixup techniques often involve 
creating synthetic graphs samples connected subgraphs or reorder the original graph structure. 
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Previous works such as G-Mixup use probability matrices to predict if an edge exists between 
two nodes, and Graph Transplant samples the top nodes in a graph and then appends a partial 
K-hope subgraph to predict edges.2°?! However, sampling subgraphs and appending them to 
the original graph becomes problematic when considering the fixed nature of brain ROIs. 
Additionally, previous studies have used temporal based augmentation techniques to im- 
prove model generalization. STDAC proposed a module using random discontinuous sam- 
pling period with a tensor fusion method to combine it with the spatial model.?? Multi-Head 
GAGNN modeled both spatio and patterns of functional brain networks simultaneously to 
fully utilize their characteristics.” These methods are still often limited by small sizes, thus 
there lies potential in combining a spatio-temporal data augmentation technique with mixup. 


2.2. Graph Pooling 


Previous Graph Pooling methods use hierarchical graph clustering methods, following the 
principle of local neighborhoods with nodes.*4 This has extended to deterministic clustering 
algorithms and attention based mechanisms to increase the quality of assigning the clus- 
ters.?5?6 Other methods include node drop pooling to decrease the time and space required 
for the process by simply selecting a subset of nodes to construct the coarsened graph. Tradi- 
tional pooling methods include selecting the top-k nodes, using self-attention networks, and 
a gated structured aware approach.?’ 2° Yet, these methods are also limited by small sample 
sizes and are prone to focusing on local structures rather than the graph as a whole. 


3. The Proposed Model 


3.1. Capturing Dynamic Connectivity via Temporal Chunking and 
EdgeConv Analysis 


We define a directed graph as G = {V, E} for each brain network subject, where V is the set of 
nodes with a time series and E represents the weighted connectivity. Temporal Chunking is 
defined by looking at a smaller window of time in the subject’s time series data at any one point 
in time rather than aggregating it as a whole. The window sizes vary from 128 to 50 to 64 and 
the starting points of the window are randomly generated for each epoch. For each generated 
window, the partial correlation matrices are extracted to form the adjacency matrix. This 
dramatically increases the variety of the data the model has to work with, allowing for more 
robust model at the end of training. It helps to combat the issue of overfititng that previous 
models have cited as limitations. The model is relevant for clinical use as it can better adapt 
to the small sample sizes that are commonly seen in MRI datasets and can better adapt to 
new patients as well. Its novelty lies in its integration of BrainGMixup which ensures maximal 
data variation by accounting for both spatial and temporal based data augmentation. 

Edge features are defined as e;; = he(x;,2;) with hg = R? x R? — R?” as the MLP for the 
model with a nonlinear function parameterized by a set of learnable parameters. x; represents 
the embedding of node i and æ; represents the embeddings of all the neighbors of node i, 
including the node itself. In this case, a sum aggregation operation is performed over all the 
edge features to get the final embedding for the node’ and its neighbor’s edges represented by 
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Fig. 1: Overview of the proposed BrainSTEAM architecture. 


the following equation: 


z; : gan aara (1) 
EdgeConv” allows for the extraction of neighborhood-level features within the overall topolog- 
ical structure of the network. Different aggregation methods can be used across the embeddings 
of the node and the neighbors. By determining the pairwise distance matrices for the charac- 
teristics and selecting the k nearest neighbors for each point, the graph is also dynamically 
updated, where k is a hyper-parameter that can be varied to obtain desirable results. 


3.2. Discovering Important Connectivity via Autoencoder-based Pooling 


Graph pooling is a key component to compress the predictions of multiple nodes into a graph- 
level classification. To discover important connectivity, we adapted Graph Autoencoder*! tech- 
nique where the node dropping is performed to measure the importance of the node for re- 
constructing the topological structure without labels. The new graph generated by the pooled 
graph can be defined as: 


= POOL(G), (2) 


where the pooling method SAGPo0ol?! acts as the encoder of the autoencoder. The SAGPool 
first generates scores for all the nodes from convolution and performs pooling by only taking the 
top k scoring nodes, with the pooling ratio determined by a hyperparameter k. Those nodes 
are then used to compose a new coarsened graph by learning the attribute and adjacency 
matrices: 


ZU) = zi zo © Sa € eet, (3) 

l nltD xnC+1) 

ACD) = A cts ide) E {0, 1} x 
where idx serves at the indexing operator for the top-k significant scoring nodes, Z; ( ce is 
the row wise indexed embedding matrix, and © is the broadcast sau product. A 


is displayed as the row-wise and column-wise adjacency matrix. Z'+! and A DD tg 
(ida ida ) 


respectively the new attribute and adjacency matrices. si D o represents the score matrix of 


are 
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the top k selected nodes at layer l. The score matrix was calculated by inputting the adjacency 
matrix and node embedding matrix at the layer l into a Graph Convolution Network (GCN). 

The decoder reconstructs the embeddings of the dropped nodes, which includes the cre- 
ation of an empty attribute matrix with the pooled node embeddings to reconstruct a new 
embedding matrix, with zero padding operations performed. To measure the validity of this 
reconstructed matrix, the Euclidean distance is calculated between the reconstructed attribute 
matrix and the original input matrix. This becomes a loss function Ly. The Euclidean distance 
is shown below: 

2 


LY = |X- vale] (4) 


Lp = |” — waz, o 
where LY represents the loss of the node attributes for the I‘ layer, ||||p is the Frobenius 
norm, and X represents the node feature matrix. An additional Lg is adopted to regularize 
the distance between the true degree values and the reconstructed ones. This determines how 
close the pooling mechanism reconstruction came to the original graph of the subject. W(#) 
represents the the reconstructed node attribute matrix. This method of pooling is preferrable 
to the typical mean, max, or summation pooling as it identifies the most structurally important 
nodes and reduces the number of noisy nodes allowing for more focused analysis. 


3.3. Enhancing Model Training and Stability via Mixup 


It is difficult for GNNs to properly analyze the underlying signals in functional brain images 
with the overfitting and memorization of noise in specific training data.*? Vicinal risk mini- 
mization?’ rather than empirical risk minimization?’ techniques have been applied to improve 
generalization capability. Vicinal risk minimization referring to creating virtual examples of 
training data based on their neighborhood of data. 

This paper proposes BrainGMixup*™ which utilizes 2D feature vectors from the node and 
edge features rather than the 1D feature vectors for other forms of data such as CNN net- 
works. This requires interpolation between the rows/ROIs of the graph rather than between 
individual feature columns. This differs from traditional Graph Mixup approaches as it in- 
volves interpolation instead of concatenation of smaller sub graphs. Mixup intends to take two 
subjects and combine their feature and edge index information to create a new node for the 
model to train on. The fixed nature of ROIs in the data allows for the mixup to be applied 
across rows for the node feature matrix and edge index matrix, 


X =X; +(1—A)X;j, where i,j =1,...,Ni4 j, 
È = AE; + (1 — à)E;, where i,j =1,..,N,i#4 j, 


(6) 


g = Ayi + (1 — A)y;. (7) 


N represents the number of ROIs defined in the node feature matrix, E is the edge index 
matrix, and y is the corresponding label. X, 7, and E are the mixup-augmented samples of 
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Fig. 2: The process of brain network construction. 


the corresponding matrices. It involves the interpolation of the previous graph samples to 
cover for in-between brain network variations 


l= A: c(p, Ya) + (1 — A) - c(p, y). (8) 


The mixup loss criterion Lm utilizes the Vicinity distribution?’ to find the chance that a 
particular feature target is in the near area of that graph to generate Lm. The differentianting 
lambda values ensures that even if data is taken at a similar timepoint, the resulting graph 
will not be the same. The hyperparameter a is used to determine the degree of interpolation 
between the different ROI regions and edge connectives. This serves as an efficient and effective 
way of accounting for the heterogeneous, scarce, and noisy nature of brain networks. The total 
overall loss is Lay = a* Lf +8 * La + Lm. With, alpha and beta serving as hyperparameters to 
determine the weight of the feature loss and degree loss. Mixup serves as an additional sample 
size increase alongside temporal chunking to provide the model with more training modules. 


4. Experiments 


Datasets. We evaluate our framework using two publicly available real-world neuroimaging 
datasets, the Autism Brain Imaging Data Exchange (ABIDE)*°on the ASD prediction task 
and the Human Connectome Project (HCP)*° on the gender classification task. The CPAC%” 
preprocessed ABIDE dataset is a collection of 4D resting-state functional MRI scans from a 
total of 1,112 individuals with 539 Autism Spectrum Disorder (ASD) and 573 typical health 
controls. The preprocessed HCP dataset, on the other hand, is a large-scale dataset that 
includes resting-state fMRI scans from 1095 subjects with the gender split being 595 females 
and 500 males with about 1200 frames in each scan. 
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Brain network construction. The brain Blood Oxygenation Level Dependent (BOLD) 
signal time series is then extracted from those fMRI subset data with MSDL brain Prob- 
abilistic atlas which defines soft parcellations of the brain to 39 ROI on ABIDE and 
CIFTI(Connectivity Informatics Technology Initiative) (ROI=22) on HCP to produce ABIDE 
time series matrix (196 x 39) and HCP time series matrix (1200 x 22). These were deter- 
mined from previous experimentation and papers on the appropriate number depending on 
the condition.** 

Then, the brain connectivity matrices among ROI are calculated from time series data 
with partial correlation and correlation matrix, followed by z-scores normalization . Non-zero 
adjacency matrices mean a pair of ROI nodes share an edge, and the values of adjacency 
matrices indicate edge weights between nodes. The sparse partial correlation matrix can help 
to avoid the over-smoothing issue commonly seen in GNN applications. Node features are 
initialized with the corresponding rows in the edge weight matrix. 

The temporal windows/chunks of each subject is constructed using a graph represen- 
tation object as seen in Figure Pls third step. Each graph representation object then goes 
through mixup to create a new graph that is a interpolation of two different subjects via the 
BrainMixup module. This data is then fed into the EdgeConv model to train, and pooling is 
conducted by the AutoEncoder. 

The EdgeConv model contains three block which each block containing a dyanmic Edge- 
Conv layer, a batch normalization layer, and a relu activation layer. Each block also includes 
the feature decoder and degree decoder layers as a part of the AutoEncoder module. The loss 
is calculated as seen in the methods section with different weights applied to the loss of the 
model and Autoencoder loss in regards to the reconstructed feature and degrees in comparison 
to their ground truth values. 


Baselines. We compare our proposed BrainSTEAM with baseline model MAGE,?? SVM- 
MTFS,4° MISO-DNN,4! e-STAGIN,*! MAGIN,*2 IMAGIN® on the ABIDE dataset, and with 
ST-GON,?8 LTSM,! GON,’ GC-LSTM,“3 STAGIN-SERO“ and DECENNT®® on the HCP 
dataset. 


Experimental settings. This study performs training and testing in 5-fold cross-validation, 
and dynamically construct graph data object for each sub-sequences of different window sizes 
with fixed optimum W as 128. The learning rate is set as 1074, epochs as 10000 for ABIDE 
and 30000 for HCP. All reported results are averaged of five runs of five-fold cross-validation. 
Additional details regarding the experiment settings can be found in the supplementary ma- 
terials. 


Prediction performance. The overall prediction results presented in Table 2 show that 
BrainSTEAM outperformed the baseline model MAGE by 9.38%, IMAGIN by 8.25% on the 
ABIDE dataset, and achieves 7.71% improvements over ST-GCN and 3.21% improvements 
over STAGIN-SERO on the HCP dataset. The results demonstrate the superiority of Brain- 
STEAM in neuropsychiatric disorder prediction and gender classification compared to other 
state-of-the-art models. 
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ABIDE HCP 
Accuracy AUC Precision Recall Accuracy AUC Precision Recall 
MAGE 75.86 83.14 71.53 79.24 ST-GCN 83.7 - - - 
SVM+MTFS 76.7227 8120.51 72.545.2 76.7227 LTSM 81.7 - - - 
MISO-DNN 77.734.26 = 76.7340 77.164s.72 GCN 83.98 - 84.59 87.78 
e-STAGIN(Sch 75.8141 70 81.12+0.30 78.0322 34 79.060 89 GC-LSTM 81.50 = = = 
MAGIN 78.122191 85.72402 78.3722. 79.552102 STAGIN-SERO 88.204133 92.96 +1.87 - - 
IMAGIN 79.254233 86.44.2022 81.032347 79.060.s0 DECENNT 86.00 93.6 87.2 88.6 


BrainSTEAM 87.50.09 89.23.+0.88 82.242248 96.1 142.47 BrainSTEAM 91.41 40.02 93.67 +0.01 1000.00 78.78+0.04 


Table 1: Overall performance (%) comparison on two datasets. Results with - were not provided 
in the original work. 


ABIDE HCP 
Accuracy AUC Precision Recall Accuracy AUC Precision Recall 
BrainSTEAM 87.50.99 89.23.0.88 82.24 40.48 96. 114247 BrainSTEAM 91.41 40.02 93.67+0.01 1000.00 78. 780.04 
BrainEAM 62.86+0.87 62.36+40.78 67.23+0.09 63.9541.70 BrainEAM 77.2041.35 80.1542.27 87.4344.49 66.59 45.2 
BrainEM 63.6641.45 62.50 41.66 68.09 42.14 71.24 +160 BrainEM 74.4240.01 74.48 +0.01 77.1140. 73.79 40.01 
BrainE 59.43.4148 59.2441 .64 60.22+0.61 63.9841. BrainE 67.85+0.01 68.01+0.01 68.97+0.01 70.46 +0.02 


Table 2: The ablation study with different model variants: BrainSTEAM is the full version 
with all components, BrainEAM removes the temporal chunking, BrainEM removes both the 
temporal chunking and autoencoder, and BrainE is only equipped with Edgeconv. 


We further investigate the influence of each proposed component by removing each at a 
time. The results are shown in Table 2. Results show the temporal module contributes to 
the greatest increase in model prediction accuracy performance, improving about 23.84% on 
ABIDE, and about 14.21% on HCP. The autoencoder module provides more stability to the 
network as seen by the decrease in the standard deviation. 

Key hyperparameter studies are shown in Fig. |3} (a) shows performance is about 1.87% 
higher when k=10 than k=15; (b) shows performance is about 3.12% higher when window=128 
than window=50; (c) shows performance is 7.03% higher when loss alpha and loss beta is set 
to 0.3 vs 0.1; (d) and (e) show performance increase dramatically when epoch increases from 
1k, 5k to 10k/30k with BrainSTEAM, on the contrary, performance stays flat for BrainEAM 
when epoch increase accordingly both on ABIDE and HCP. 


5. Interpretation Analysis 


As summarized above, the proposed BrainSTEAM is shown to significantly outperform base- 
line models. We claim the fundamental reason is that the other baselines only obtain one 
graph from the subject full range of time series thus only resulting in 1112 graphs for ABIDE 
and 1095 graphs for HCP. With our proposed time series temporal chunk combined with the 
mixup, an exponential increase in the number of new graphs can be generated. Specifically, the 
model is trained on the same 1000 subjects but the generation of time series chunks with 5-fold 
cross-validation for 30,000 epochs leads to 150,000 different graphs. Hyperparameter tuning 
with epochs reveals that the BrainEAM model hits a training accuracy of 99% in 200 epochs, 


60 


Pacific Symposium on Biocomputing 2024 


90.00% 90.00% 90.00% 
88.00% 88.00% 86.00% 
ren z è 
f 86.00% 5 86.00% fo 82.00% 
3 F} 3 
8 84.00% 8 84.00% 8 78.00% 
82.00% 82.00% 74.00% 
80.00% T 1 80.00% 70.00% T 1 
k=10 k=15 win=128 win=50 loss=0.1 loss=0.3 
(a) K-neighboring nodes (b) Window Size (c) Loss_alpha,loss_beta 
100.00% 100.00% 100.00% — | 
95.00% 95.00% | 
90.00% - 90.00% 90.00% | 
85.00% 85.00% - | 
= 80.00% = 80.00% | 
5 75.00% 5 75.00% z oook 
& 70.00% & 70.00% 3 
65.00% = 65.00% & 7000% - 
60.00% Sra e 60.00% | 
55.00% 55.00% 60.00% | 
50.00% -+ r 50.00% + i | 
1K 5K 10k 1k 5k 30k i A A 
(d) ABIDE Epoch Comparison (e) HCP Epoch Comparison 50.00% rere ea Chee | 
=o BRAINEAM =E BrainSTEAM =o BRAINEAM —ll—BrainSTEAM (f) Model Ablation Comparison | 


Fig. 3: The hyperparameter study for BrainSTEAM on the 
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ABIDE dataset. 


(h) ABIDE ASD Sub W=50 


Fig. 4: The visualization of brain connectome, where the subfigure (a) & (e) represent the 
connectome of ABIDE Health Control (HC) with the full sequence of time series; (b) & (£) 
represent that of Autism; (c) & (g) represent HC with time series subsequence of window size 
50; and (d) & (h) represent Autism with time series subsequence of window size 50. 


but the validation accuracy stays in the low 60s indicating a typical sign of overfitting. When 
applying temporal chunking, the training and validation accuracy scale more evenly with an 
18.13% increase of validation accuracy as the number of epochs varies from 1000 to 10000. 
The temporal chunking results as visualized in Fig. |4| demonstrates that graphs generated at 
different time windows have significantly different levels of connectivity between ROIs. This 
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stood true for both health control and patients diagnosed with Autism. The level of interac- 
tions for the health control is far more pronounced, as noted with the increase in deep red 
boxes, than for the Autism patient. This fine-grained interaction difference is not expressed 
within the graph generated from an average of the entire time series. This demonstrates that 
the proposed temporal chunking method is able to better capture time specific interactions in 
the brain and will generate more robust generalization patterns. 

The model outperforms ST-GCN, demonstrating that only the temporal module might 
not be comprehensive enough to cover all the issues that create overfitting and accuracy 
deficits. A combination of retaining connectivity information and performing self-supervised 
node dropping is needed to create the most robust version of the model. 


6. Conclusion 


This study proposes a dynamic functional brain network analysis framework BrainSTEAM, 
which integrates the temporal sliding window module with EdgeConv, Autoencoder and Mixup 
for the first time. Extensive experiments on two real-world neuroimaging datasets exhibit sig- 
nificant performance improvement over the state-of-the-art. This study also shows the con- 
tribution of each component to the system, demonstrating the temporal chunking approach 
as the major contributor to performance improvement, which allows for the representation 
of functional brain connectivity within smaller time windows to capture unique fine-grained 
ROI interactions. In the meantime, the study also shows EdgeConv helps in capturing the 
connectivity structures of the brain networks, autoencoder helps in reducing data noise and 
identifying the most relevant connectivity patterns, and mixup helps in enhancing the model 
training through linear interpolation. For future work, we look to improve BrainSTEAM with 
explainability, such as identifying meaningful biomarkers linked to neuropsychiatric disorders 
and mental development, understanding which neural systems contribute most to the predic- 
tion of a specific disease, applying the model to other datasets and tasks, and exploring its 
potential applications in clinical settings. 
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Topological data analysis (TDA) combined with machine learning (ML) algorithms is a powerful 
approach for investigating complex brain interaction patterns in neurological disorders such as epilepsy. 
However, the use of ML algorithms and TDA for analysis of aberrant brain interactions requires 
substantial domain knowledge in computing as well as pure mathematics. To lower the threshold for 
clinical and computational neuroscience researchers to effectively use ML algorithms together with TDA 
to study neurological disorders, we introduce an integrated web platform called MaTiLDA. MaTiLDA 
is the first tool that enables users to intuitively use TDA methods together with ML models to 
characterize interaction patterns derived from  neurophysiological signal data such as 
electroencephalogram (EEG) recorded during routine clinical practice. MaTiLDA features support for 
TDA methods, such as persistent homology, that enable classification of signal data using ML models 
to provide insights into complex brain interaction patterns in neurological disorders. We demonstrate the 
practical use of MaTiLDA by analyzing high-resolution intracranial EEG from refractory epilepsy 
patients to characterize the distinct phases of seizure propagation to different brain regions. The 
MaTiLDA platform is available at: https://bmhinformatics.case.edu/nicworkflow/MaTiLDA 
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1. Introduction 


The increasing availability of multimodal brain activity recordings highlights an emergent demand for 
accurate and reliable analytical methods to characterize brain interaction dynamics to meet clinical 
research goals and to improve patient care’. The analysis of brain recordings provide insights into the 
dynamics of interaction patterns involving specialized brain regions that may be responsible for higher- 
order brain functions’. Understanding disruptions in brain interaction patterns is crucial to 
characterizing neurological disorders, revealing pathophysiological mechanisms, and defining 
biomarkers for clinical diagnoses’. These research goals are particularly important in epilepsy, which 
is a complex neurological disorder affecting over 50 million individuals worldwide*. Epilepsy is 
characterized by recurrent seizures stemming from abnormal electrical discharges that spread 
throughout the brain*. Similar to other disease domains, there has been a rapid increase in the use of 
machine learning (ML) algorithms to study brain interaction dynamics in epilepsy patients>®. ML 
algorithms such as support vector machines (SVM) have used features extracted from 
neurophysiological signal data, such as electroencephalogram (EEG), to lateralize seizure onset zone 


for subsequent surgical intervention >. 


Graph-based models of networks are commonly applied to characterize interaction patterns in the brain; 
however, recent studies have used rigorous algebraic topology methods to analyze brain recordings to 
address several limitations of graph-based models*”!°. Topological data analysis (TDA) is a 
quantitative framework that can be used to characterize higher-dimensional interaction patterns by 
using robust, scale-invariant methods, such as persistent homology!'. Specifically, quantitative 
measures generated from persistent homology values, such as persistence landscapes, persistence 
images, and persistent entropy, have highlighted the promise of applying TDA methods to analyze EEG 
data with respect to seizure (ictal) activity>””!° and to distinguish seizure onset from preictal activity’. 
Moreover, TDA methods have been integrated with ML algorithms for several applications”, including 


characterizing brain interaction dynamics. 


The development and use of an integrated ML and TDA tool to characterize brain interaction dynamics 
is a resource-intensive endeavor that demands expertise in domains such as mathematics, neurology, 
and computing. Therefore, there is a high entry barrier for the wider neuroscience community to use 
TDA methods and ML algorithms together for research studies!*'*. To address this critical barrier, we 
introduce MaTiLDA as the first integrated web platform for TDA methods and ML algorithms to 
analyze neurophysiological recordings. We demonstrate the practical utility of MaTiLDA by 
characterizing brain interaction dynamics in refractory epilepsy patients using high resolution 
intracranial EEG (iEEG) recordings. 
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2 Background 


2.1 The Neuro-Integrative Connectivity platform 


Over the past decade we have developed an integrated neuroinformatics workflow tool called the 
Neuro-Integrative Connectivity (NIC) platform to automate the multi-step methods used to characterize 
brain interaction dynamics using signal data'*'’. The NIC platform is a modular tool that supports 
addition of new modules in a flexible manner as support for new functionalities, including ML, are 
added. One module transforms neurophysiological signal recording stored in European Data Format 
(EDF) into a JSON- based human-readable format with semantic annotations using an epilepsy domain 
ontology that is more suitable for storage and analysis'°. A second module computes signal coupling 
measures using both frequency and amplitude features of the signal data!®. A third module computes a 
variety of graph model-based metrics'’. A fourth module supports persistent homology functions using 
open source libraries such as GUDHI’®. We refer to our previous work for additional details of the NIC 
tool'>"'’, MaTiLDA is an extension of the NIC tool to enable users to use TDA with ML algorithms for 
integrated analysis of signal data. 


2.2 Topological data analysis of EEG 


Brain functions are often characterized by interaction between 
multiple brain regions’; therefore, TDA is well-suited to 


Dimension 


characterizing these interaction patterns with high 2s Á erst 
dimensionality, which cannot be easily represented using graph A dae 
models'*. Persistent homology is a TDA method that has been l i a 


successfully used to identify brain states by analyzing multi- Figure 1: A persistence diagram from 
dimensional interactions across brain regions>””"!*. Specifically, our analysis (section 2.6). A persistence 
studies applying persistent homology to neurophysiological diagram is a visualization of the results 
signal data have shown the promise of TDA in characterizing from persistent homology, where each 
aberrant brain interaction dynamics in neurological point represents one homology class. 
disorders*”"'4, In this section, we briefly describe the terminology associated with TDA methods to 
facilitate understanding of the subsequent sections of the paper. 


Persistent homology is a TDA method used to quantify the presence of topological structures, called 
homology classes, across various thresholds, or filtration values!*!?°, A homology class is a boundary 
composed of simplices, defined as the convex hull of a set of p+1 vertices’. A simplex has dimension 
p, and is referred to as a p-simplex, if it has a cardinality of p+1!°. Persistent homology tracks the 
filtration at which each homology class is created (birth), the filtration at which it is terminated (death), 
and dimension of each homology class. These values can be visualized with a persistence diagram 
(Figure 1), a plot representing birth along the x axis and death along the y axis!!!3!". The lifespan, 
(death minus birth) of homology classes, as displayed in the persistence diagram, can be analyzed across 
various periods of neurophysiological signal recording to identify changes in topological structures and 
gain insights into the topology of brain networks!!!%!4, We refer interested readers to Edelsbrunner and 
Harer'! for further descriptions of persistent homology. 
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3. Methods 


The computation and analysis of topological features from neurophysiological signal data entails 
multiple stages of processing, which include extraction of signal data, computation of signal coupling 
measures, TDA of signal coupling, data cleaning, and comparative analysis of topological features 
(Figure 2). Scientific workflow systems like the NIC platform have been used to automate these multi- 
step processes!’, In this paper, we describe MaTiLDA as an extension of the NIC platform to implement 
integrated support for TDA and ML algorithms for brain interaction studies. 


Brain Activity Recording Signal Coupling Topological Data Analysis MaTiLDA 


Persistent Homolog R z 
ctroencephalography (LEG) tae b Machine Learning of Topological l'eatures 
F eae! 4 between Llectrodes Veitoris-Rips L'il 3 2 Pas 


Figure 2: Our framework for computing and comparing topological features from neurophysiological 
recordings. EEG from intracranial electrodes is used to extract signal data during epileptic seizures. Signal 
coupling is calculated using the nonlinear regression coefficient developed by Pijn et al.*'. Persistent homology 
is applied to the signal coupling values using a Vietoris-Rips filtration as implemented in GUDHI"®. MaTiLDA 
then allows users to select specialized data structures such as persistence landscapes or persistence images to 
use as input for user-selected machine learning classification such as SVM. 


Signal Data (EDF) 
3.1 MaTiLDA architecture and development Coro”) BEE 
The MaTiLDA platform was built using the Django web P Signal Data (CSF) 
application framework, which uses the Python |C|[Step2: Signal Coupling 9 + ++++++ ++ 


programming language and features several libraries and 
modules that support a variety of data processing and 
analysis tasks including libraries for ML and TDA. 
MaTiLDA adopts the Model View Template (MVT) 
approach, with user inputs managed by an object 
relational data component (Model), the user interface 
handled by the View component, and user interaction 
mediated by the Template component. See eee 


Figure 3: The MaTiLDA workflow leverages 
the NIC workflow to compute signal coupling. 
MaTiLDA leverages modules from the NIC tool and MaTiLDA applies persistent homology to the 
maintains a modular analysis process (Figure 3). Before ©UPling values and allow users to select 
analysis with MaTiLDA, neurophysiological recordings sa a ee ee 


; homology values for input into machine 
such as those from EEG are processed with the NIC tool jegrning classifications of their choice. 


Step 3: Persistent Homology; ++ +++... 


Step 4: Feature Engineering ---+--+-++++- 


>or- ds z 


Step 5: Machine Learning p''tsrsreee 


3.2 A framework for classifying brain states 
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to convert from EDF to CSF and to 
compute signal coupling measures 
that can be used as input into 
MaTiLDA for a desired ML 
classification task. Users are required 
to provide a set of folders each 
containing a set of coupling measure 
values (Figure 4). Users can 
subsequently apply MaTiLDA’s 
persistent homology function, using a 
Vietoris-Rips filtration, to each input 
using the GUDHI! library. The 
persistent homology values are 
transformed into a specialized data 
structure as requested; these data 
structures are used as input values for 
ML models selected by the user. A 
ML model is trained using an 80% 
data partition. Labels are predicted 
for the remaining 20% data partition 
as a test set. The test set accuracy 
score is reported alongside the 
precision, recall, and the area under 
the receiver operating characteristic 
(ROC) curve. Accuracy scores are 
calculated as the number of correctly 
identified predictions out of total 
predictions”. Precision is calculated 
as the number of true positive 
predictions divided by the number of 
positive predictions”>. Recall, or 
true positive rate, is calculated as the 
number of true positive predictions 
divided by the number of positive 
samples”**>, The ROC curve is a plot 
of the true positive rate along the y- 
axis against the false positive rate 
along the x-axis for varying vales of a 
threshold used to classify samples”. 


File Path 


Names of subfolders 
for class-specific data 
Class Labels 


Dimension 
2 Use one or all dimension(s) up to 
an including the specified valuc 


Max dimension? 


Persistent Homology Representations 


ae Selechancaninore 


ML Classification Algorithms featurizaion and 


machine learning 
methods 


Path to folders containing matrices of signal coupling values 


Optional Hyperparameter Tuning 


Support Vector Machine 


For more information, see scikit-learn — Support Vector Machine 


Regularization Parameter (C) 


1.0 


Kernel 
Linear 
Polynomial of degree: 3 _ 
V) RBF 


Sigmoid 


Modify hyperparameters for any 


machine learning or featurization 


Gamma method selected (optional) 


Scale 
Auto 


Float: 


Random Forest 
For more information, see scikit-learn — Random Forest 
Number of Trees 


100 


Criterion 

v) Gini 
Entropy 
Log Loss 


Max Depth of Trees 


If left blank, there will be no max depth 


Minimum Number of Samples Required for Split 


Figure 4: MaTiLDA supports various representations of persistent 
homology values in specialized data structures and ML algorithms 


with optional hyperparameter inputs. 


Users provide a folder 


including subfolders of outputs from the NIC correlator module, a list 
of all class labels (subfolder names), and a dimension for analysis. 
Users may select multiple data structures and multiple machine 
learning classification algorithms for their analysis using the 
checkboxes. For any selected representation or machine learning 
algorithm, a set of hyperparameters will appear in the left of the 
screen. The user may refine these parameters or use the preselected 
defaults. MaTiLDA will run each combination of representation- 
algorithm pairs selected for analysis. In the example provided above, 
the results from & analyses will be given. 
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The area under the ROC curve (AUC) measures the average classification accuracy across all 
thresholds”. A separate ML model is run for each combination of selected data structures and ML 
algorithms. By default, all ML models are implemented using default model parameters from Scikit- 
learn and GUDHI; however, users have the option to modify these parameters. 

NIC 


Topological Feature Learning 


Results — Accuracy 


Representation ML Algorithm Accuracy 
Homology Class Lifespans SVM 0.75 
Persistent Entropy SVM 0.75 
Landscapes SVM 0.75 
Homology Class Lifespans RF 0.75 
Persistent Entropy RF 0.708 
Landscapes RF 0.625 
Homology Class Lifespans LR 0.75 
Persistent Entropy LR 0.75 
Landscapes LR 0.75 

Results — Precision 
Representation ML Algorithm ictal ictal2 ictal3 onset 
Homology Class Lifespans SVM 0 0 0.75 
Persistent Entropy SVM 0 0 0.75 0 
Landscapes SVM 0 0 0.75 0 
Homology Class Lifespans RF 0 0 0.75 0 
Persistent Entropy RF 05 0.25 0.833 
Landscapes RF 0 0 0.714 0 
Homology Class Lifespans LR 0 0 0.75 0 
Persistent Entropy LR Lt) 0 0.75 0 
Landscapes LR 0 o 0.75 


Results — Recall 


Figure 5: Results for one seizure from a multiclass classification of ictal phases for patient one using homology 
class lifespans, persistent entropy, persistence landscapes, or persistence images as input to SVM, random forest, 
and logistic regression models. 


3.3 MaTiLDA user interface 


The MaTiLDA user interface (Figure 4) consists of an intuitive data entry module and a minimal results 
table (Figure 5). MaTiLDA requires users to specify a directory containing several subdirectories, each 
of which should contain signal coupling values derived from neurophysiological signal data. MaTiLDA 
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internally manages all data preprocessing, expecting signal coupling values to be in the format produced 
by the NIC tools. A list of labels must be specified by the user; these labels will be matched to the 
subdirectory names to select and label signal coupling data from the main directory provided. Users 
must select a dimension for analysis; they may limit analysis to homology classes of that dimension, or 
they may analyze homology classes of dimension 0 through that dimension. Users may select several 
specialized data structures as representations for persistent homology values as well as several ML 
algorithms from a set of available options and may refine parameters for each selection using simple 
radio buttons and numeric input fields. Results are generated for all representation-algorithm pairs 
selected. The results table displays the representation chosen, the ML algorithm used, the model’s 
accuracy in testing data, the true positive rate, the false negative rate, and the AUC. 


3.4 Topological feature representation for machine learning 


A key challenge for applying persistent 
homology lies in the difficulty of statistical 
interpretation of results. Persistent 
homology values lack geometric properties ||. A AE E eae 
that would allow for the definition of basic ||;"| = | oh 
statistical concepts such as mean or median”. |) 
While persistence diagrams are an intuitive 
visualization method for representing the 
attributes of topological structures, the visual 
component of persistence diagrams makes it 


Persistence Landscape 
pal A 


Persistence Landscape Vector 


0.001, ... 0.001, 0.002, 0.003, ... 0.0] 


Persistence Diagram 


. 0.082, 0.083, 0.084, ... 0.0] 


Persistent Entropy 


2.886 [-0.012, -0.087, -0.196, -0.006, -0.025] 


challenging to use statistical methods to 
quantitatively analyze them!?!39, 
Additionally, persistence diagrams are not 
vectors in a Hilbert or Banach space and thus 
a unique mean cannot be established to define 
statistical measures!”!3, Moreover, persistent 
homology values, and the persistence 
diagrams representing them, do not maintain 
a consistent number of homology classes, 
which creates a challenge for conducting 
balanced comparisons!*. Consequently, a 
range of quantitative methods have been 


Figure 6: MaTiLDA offers several options for representing 
persistent homology values as vectors in Euclidean space, 
including persistence landscapes, persistence silhouettes, 
persistence images, persistent entropy, and homology class 
lifespans. Homology class lifespans create a list of values 
from the lifespans of all homology classes in a persistence 
diagram. Persistence landscapes and silhouettes transform 
persistence diagrams and apply a tent function before 
sampling uniformly across the transformed axis to create a 
list of values. Persistence images convert a persistence 
diagram into a two-dimensional image where each pixel 
represents a rectangular area of the diagram, and the 
intensity of the image represents the frequency of occurrence 
of homology classes. Persistent entropy is the Shannon 


devised to facilitate the integration of persistence diagrams and persistent homology values into ML 
classifications. These methods for feature engineering can be used to represent persistent homology 
values as specialized data structures that can be used as input to ML models!*!?. We provide the 
necessary background for the five quantitative methods for persistent homology value representation 
that have been implemented in the initial version of MaTiLDA: homology class lifespans, persistence 
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landscapes, persistence silhouettes, persistence images, and persistent entropy (Figure 6). In this work, 
we show how MaTiLDA can be used to intuitively conduct analyses by using these quantitative 
methods to represent persistent homology values derived from coupling measures computed from 
neurophysiological recordings and using the resulting features as input into ML algorithms. 


3.4.1 Homology class lifespan 


We calculate the lifespan for each homology class resulting from persistent homology and store the 
values in a list. Lifespan lists are ordered based on the lifespan values such that the first value in the 
lifespan list is the longest lifespan within that list. The lifespan list has a length equivalent to the sum 
of the Betti numbers (the number of homology classes) from all dimensions included in analysis. We 
create the input features for ML algorithms using tensor data structures that are padded with zero values 
to account for varying length of the tensors corresponding to different homology class lifespan values. 
Our methods are similar to the work described in the study by Bendich et al.”*; however, unlike Bendich 
et al., we do not limit the number of lifespan values included in a list. 


3.4.2 Persistence landscapes & silhouettes 


The persistence landscape is a sequence of piecewise-linear functions, 21, A2,...: ROR, that map 
persistent homology values to a vector space, where A, refers to the k" persistence landscape function”. 
The persistence landscape can be calculated using Eq 1, where t denotes the filtration value, kmax 
denotes the k" largest element in the set of persistent homology values, J, and each homology class in 
I has a birth b; and a death d”. 


A(k,t) = kmax{max(0, min(birth; + t,death; — t)}ie; (1) 


The persistence landscape is plotted with the filtration along the x axis and the persistence landscape 
value A(k, t) along the y axis (Figure 6). A vector is created by uniformly sampling points along the 
x-axis and calculating the maximum of the persistence landscape functions at that point!?. A 
persistence silhouette is a variation of the persistence landscape in which a vector is created by taking 
the weighted average of the functions, rather than the maximum!”°. The advantages of persistence 
landscapes and silhouettes are that they are invertible, parameter-free, nonlinear, and have desirable 
properties for statistical modeling including a unique mean!?”. 


3.4.3 Persistence images 


To create a persistence image, a Gaussian function is applied to each homology class resulting from 
persistent homology”. The weighted sum of Gaussian functions are discretized to define a grid, and a 
matrix of pixel values is created by taking the integral of this grid on each grid box?’. Consequently, 
each pixel value in the persistence image represents a rectangular area of the persistence diagram, and 
the intensity of the image represents the frequency of occurrence of homology classes!””’. Persistence 
images require a distribution, a resolution, and a weighting function to calculate'’. The advantages of 
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persistence images are that they are stable, interpretable, and computationally efficient representations 
in R927, 


3.4.4 Persistent entropy 


Persistent entropy is a single value representing the Shannon entropy of a probability distribution 
obtained from persistent homology’. The persistent entropy of a set of persistent homology values can 
be calculated using Eq (2), where |; is the lifespan of a topological structure”®. 


li li 
= 5, 08 G (2) 


3.5 Machine Learning of Topological Features 


In the MaTiLDA pipeline (Figure 4), persistent homology is applied to signal coupling values derived 
from neurophysiological signal recordings. Based on user specification (section 2.3), feature 
engineering is applied to the resulting persistent homology values to create specialized data structures 
(section 3.4) to be used as input features for ML models. Five common algorithms for ML classification 
were selected to be implemented in the initial version of MaTiLDA: support vector machines, random 
forest, gradient boosted trees, K-nearest neighbor, and logistic regression. In this section, we provide a 
brief introduction to each of these algorithms. 


3.5.1 Support vector machine 


Support vector machine (SVM) is a supervised learning algorithm that aims to find the best-separating 
function, called a kernel, to classify data into different categories**. While kernels do not naturally 
distinguish between more than two classes, SVM can be extended to multi-class classification problems 
using approaches such as the one-vs-one and one-versus-rest approaches’. For MaTiLDA, multi-class 
classifications using SVM are handled using the one-versus-rest approach. In the one-versus-rest 
approach, for a classification of K classes, SVM will fit K kernels where each kernel will compare one 
of the K classes to the remaining K-1 classes’. 


3.5.2 Random forest and gradient boosted trees 


Random forest (RF) is a form of decision tree bagging (generating several training sets by sampling 
from the original training set with replacement) that focuses on making the ensemble of decision trees 
more diverse”. As in bagging, an ensemble of trees is built based on bootstrapped training samples”. 
However, rather than varying the training sets, a random sampling of attributes is selected at each split 
point in the tree; of this sample, the attribute with the highest information gain is selected as the split’. 
A majority vote from the tree-specific predictions is used to classify each example”. 


Gradient boosted trees (GBT), like random forest, is a powerful learning algorithm that can learn 
complex, non-linear relationships’. GBT is a boosting algorithm using gradient descent”. While 
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bagging builds trees on bootstrapped data independently of other trees, boosting uses a modified version 
of the original dataset to sequentially grow trees such that each tree is grown using information from 
previously grown trees”. 


3.5.3 K-nearest neighbor 


K-nearest neighbor (KNN) is a non-parametric, supervised learning classifier that facilitates 
classification for observations by leveraging their proximity to the K nearest datapoints, or neighbors, 
in the training data”*’. The classification decision is made through a majority voting scheme among 
the K nearest neighbors”. KNN has a high computational cost due to performing distance calculations 
for each observation”. 


3.5.4 Logistic regression 


Logistic regression (LR) models the probability that an observation belongs to a particular class”. By 
employing a logistic function, a linear combination of predictors is mapped to the range [0, 1], allowing 
LR to estimate the probability of class membership using maximum likelihood estimation”. 


3.6 Validation of MaTiLDA 


Epilepsy is the second most common neurological disorder* and presents a unique opportunity for the 
application of TDA to study aberrant brain interaction dynamics. Epilepsy is characterized by recurrent 
seizures stemming from abnormal electrical discharges that spread throughout the brain and disrupt 
normal functioning**°. Most significant changes to brain interactions during seizures occur during the 
spread of aberrant activity to new brain regions (referred to as ictal phases such as ictal | phase, ictal 2 
phase, etc.) and the termination of a seizure*’. One approach to understanding these changes in brain 
interaction dynamics is the classification of these ictal phases. To validate the use of the MaTiLDA 
interface for characterizing aberrant brain interaction dynamics using TDA and ML, we apply the 
MaTiLDA pipeline to analyze neurophysiological signal data from a cohort of four refractory epilepsy 
patients undergoing pre-surgical evaluation in the epilepsy monitoring unit (EMU) at University 
Hospitals Cleveland Medical Center’s level 4 epilepsy facility that regularly performs epilepsy surgery. 
All patients were between the ages of 25 and 50 and had refractory epilepsy; 75% of the patients were 
women. Table I shows the characteristics of these patients. Using MaTiLDA, we applied TDA and ML 
to analyze iEEG recordings from two seizures from each of these patients to classify ictal phases 
including seizure onset and propagation to different brain regions. 


3.6.1 Study Data 


We selected iEEG recordings from two seizures each from four refractory epilepsy patients undergoing 
pre-surgical evaluation. Intracranial electrodes are implanted based on a presurgical protocol described 
in work by Wu et al.*!. Retrospective visual analyses of EEG recordings were conducted using a Nihon- 
Kohden Neurofax system (Nihon Kohden America, Foothill Ranch, CA, U.S.A.) with AC amplifiers, 
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a high sampling rate of 2,000 Hz, and an acquisition rate spanning 0.016-300 Hz*!**. The EEG was 
filtered at 600 Hz with a 0.03s time constant and sensitivity ranging from 30-100 uV based on optimal 
seizure visibility for each implant?!*?. A 60 Hz notch filter was applied to all EEG recordings*!. 
Clinicians defined seizure onset as the earliest distinctive occurrence of rhythmic sinusoidal activity or 
repetitive spikes; the region of activity was noted as the seizure onset zone*!. Ictal phases were defined 
as the subsequent spread of seizure activity to new brain regions. EEG sequences were broken down 
into one second epochs and features were computed for each epoch. 


Table 1: Characteristics of two seizures from four randomly selected refractory epilepsy patients. 


. . Seizure 
i Age Epileptogenic ee . , Ictal . . 
Patient Sex Medication Duration Active Electrodes Seizure Semiology 
Range Zone Phases 


(s) 


Aura > mouth and hand 
IM1, IM8-9, SM1-3, IL6-8, ML1-8, SP2-5, IP1- 


: 4 2 automatisms > mild 

Left Trileptal, 3, MP1-3, HH1-10 ; . 

1 25-30 F , combativeness & amnesia 
Hemisphere Keppra 
IM1, IM8-9, SM1-3, IL6-8, ML1-8, SP2-5, IP1- 
43 2 Aura 
3, MP1-3, HH1-10 

Lamotrigine, 90 TP1-8, AM1-8, HB1-2, RA1-2 RH1-8, HH1-8 2 Aura > postictal aphasia 

2 45-50 M Bitemporal Phenytoin, . . 
120 TP1-8, AM1-4, HB1-2 2 Aura > postictal aphasia 


Valproic Acid 
HH1-3, HB1-3, AM1-3, MI1-12, PI1-12, IA1- 


Left . 120 4 Abdominal aura. 
. Trileptal, 12, IM1-12, SA1-12, MA1-12 
3 20-25 F Mesial : 
Vimpat HH1-3, HB1-3, AM1-3, MII-12, PI1-12, IA1- . 
Temporal 120 4 Abdominal & gustatory aura 


12, IM1-12, SA1-12, MA1-12 


After stimulating AM3 with 
HH2-3, EM8-9, HH1-12, HB1-12, TT1-12, 


S 60 4 50Hz, 4.6mA, 3s, patient felt 
Right OF1-12 
. Keppra, "oozy" 
4 30-35 F Mesial . ; ; ; 
Lacosamide After stimulating AM4 with 5Hz. 
Temporal AM1-2, EM9-10, HH1-12, HB1-12, TT1-12, : 
60 4 7mA, 3 seconds, patient felt 
OF1-12 
funny 
3.6.2 Study Design Table 2: The sample size of each class is equal to the duration 


. . of the associated ictal phase. 
All seizure data was preprocessed using the 


NIC tools. For each seizure, we used , ; Duration of Ictal Phase 
; : Patient Seizure 

MaTiLDA to apply persistent homology to Onset Ictall Ictaąl2  Ictal3 
signal coupling values from one-second 1 15 33 - - 
epochs of iEEG data and to create data l 2 15 28 : - 
structures representing the resulting persistent 1 10 80 - - 
homology values that were used as input into : 2 5 115 - - 
ML models to classify epochs as belonging to 3 1 10 15 5 90 
an ictal phase. Of the eight seizures selected, 2 10 15 5 90 
four seizures were analyzed in binary 4 l 10 15 5 30 


classification tasks to classify seizure onset 2 K 13 3 3 


from ictal 1 phase, and the remaining four seizures were analyzed in multiclass classification tasks to 
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classify ictal phases (seizure onset, ictal 1 phase, ictal 2 phase, ictal 3 phase, and ictal 4 phase). Each 
seizure was analyzed separately. The number of one-second epochs in each ictal phase of each seizure, 
equivalent to the sample sizes of each class label in each seizure-specific analysis, is provided in Table 
2. Default parameters were used for all representations of persistent homology values and for all ML 
algorithms in the analysis of each of the eight seizures to show the baseline capabilities of MaTiLDA. 


4. Results 


To validate the use of the MaTiLDA interface, 
we aimed to classify ictal phases within a 
seizure for eight seizures from four refractory 
epilepsy patients, as described in section 2.6. 
For brevity, we present only the results from the 
analysis of persistent homology values in 
dimension 0. 


Binary classifications were used to compare 
seizure onset and ictal 1 phase for the four 
seizures from patient one and patient two, as 
these seizures were limited to these two ictal 
phases. Due to space constraints, we review 
only the results for RF, SVM, and LR models 


Patient 1 Seizure 1 


Patient 1 Seizure 2 


True Positive Rate 


— Lifespans & SVM 
=== Landscapes & SVM 
—~ Lifespans & RF 
Landscapes & RF 
— Lifespans & LR 
==- Landscapes & LR 


True Positive Rate 


— Lifespans & SVM 
es 


==- Landscapes & LR 


04 06 08 10 
False Positive Rate 


Patient 2 Seizure 1 


— Lifespans & SVM 
--- Landscapes & SVM 
= Lifespans & RF 
Landscapes & RF 
— Lifespans & LR 
==- Landscapes & LR 


‘True Positive Rate 


Patient 2 Seizure 2 


E= 


spans & SVM 
--- Landscapes & SVM 
~~ Lifespans & RF 
Landscapes & RF 
— Lifespans & LR 
-=-- Landscapes & LR 


False Positive Rate 


using either the lifespan or persistence Figure 7: ROC curves for each seizure from the binary 
classifications for seizures from patients one and two using 


lifespans or persistence landscapes in SVM, RF, or LR. 


landscape methods. ROC curves can be seen for 

each of these models for all four seizures in 

Figure 7. Model performance varied across all 
seizures, and no ML algorithm or representation 
of persistent homology values outperformed 
others to consistently distinguish seizure onset 
and ictal 1 phase (Figure 8). This may be due to 
imbalanced class sizes (Table 2). For example, 
the 20% test partition of patient two’s second 
seizure contained only one epoch from seizure 
onset, and only four epochs from seizure onset 
were included in the 80% training partition. For 
all combinations of ML algorithms and 
representations of persistent homology values, 
this one epoch was misclassified as belonging 

to ictal 1 phase, resulting in precision and recall 

values of 0 and an AUC of 0.50 but an accuracy 


Patient 


Seizure 


Landscape 


Lifespan 


N 


Figure 8: MaTiLDA’s model performance for RF, 
SVM, and LR using lifespans or persistence landscapes 
for the four seizures from patients one and two. 
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of 0.96. Increasing the number of samples from seizure onset may improve the ML models (as seen for 
patient two’s first seizure). MaTiLDA’s implementation of data augmentation, however, is still under 
development. 


Multiclass classifications were used to classify seizure phases for each of the remaining four seizures 
from patients three and four which included multiple ictal phases (seizure onset, ictal 1 phase, ictal 2 
phase, ictal 3 phase, and ictal 4 phase). Due to space constraints, we limit our results to the RF models 
using the lifespans and persistence landscapes (Figure 9). No algorithm or representation of persistent 
homology values consistently outperformed others to classify ictal phases, and there was high variation 
in model performance within and across seizures (Figure 9). 


m Seizure Onset Ictal Period 1 w Ictal Period 2 w Ictal Pen 


08 


O.C 


Lifespan Lifespan Lifespan 


Seizure 1 Seizure 2 Seizure 1 Seizure 2 


Patient 3 Patient 4 


Figure 9: MaTiLDA’s One-vs-Rest AUC values for RF classification of ictal phases using lifespans or persistent 
landscapes for each of the four seizures from patients three and four show high variation in model performance 
within and across seizures. 


5. Discussion & Conclusion 


The results of this evaluation demonstrate that MaTiLDA is an effective tool for analyzing complex 
topological features, enabling the detection of changes in brain interactions during seizures. We have 
developed a novel pipeline that can classify brain states, such as the ictal phases of several seizures in 
this study, using various common TDA methods and ML algorithms. The MaTiLDA platform provides 
a robust, accessible, and reliable framework for applying TDA and ML algorithms to datasets from 
neurophysiological recordings to characterize brain interaction dynamics in neurological disorders. 
MaTiLDA enables the wider neuroscience research community, who have limited experience in both 
TDA and ML algorithm implementation to use ML and TDA algorithms to analyze the increasingly 
large volumes of brain activity recordings and characterize brain interaction dynamics. We believe that 
the MaTiLDA tool can be used in future research to investigate complex brain interaction patterns in 
neurological disorders such as epilepsy, and allow clinicians and researchers to characterize 
neurological disorders, understand pathophysiological mechanisms, and identify biomarkers for clinical 
diagnoses. 
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In the intricate landscape of healthcare analytics, effective feature selection is a prerequisite 
for generating robust predictive models, especially given the common challenges of sam- 
ple sizes and potential biases. Zoish uniquely addresses these issues by employing Shapley 
additive values—an idea rooted in cooperative game theory—to enable both transparent 
and automated feature selection. Unlike existing tools, Zoish is versatile, designed to seam- 
lessly integrate with an array of machine learning libraries including scikit-learn, XGBoost, 
CatBoost, and imbalanced-learn. 

The distinct advantage of Zoish lies in its dual algorithmic approach for calculating 
Shapley values, allowing it to efficiently manage both large and small datasets. This adapt- 
ability renders it exceptionally suitable for a wide spectrum of healthcare-related tasks. The 
tool also places a strong emphasis on interpretability, providing comprehensive visualiza- 
tions for analyzed features. Its customizable settings offer users fine-grained control over 
feature selection, thus optimizing for specific predictive objectives. 

This manuscript elucidates the mathematical framework underpinning Zoish and how 
it uniquely combines local and global feature selection into a single, streamlined process. 
To validate Zoish’s efficiency and adaptability, we present case studies in breast cancer 
prediction and Montreal Cognitive Assessment (MoCA) prediction in Parkinson’s disease, 
along with evaluations on 300 synthetic datasets. These applications underscore Zoish’s 
unparalleled performance in diverse healthcare contexts and against its counterparts. 


Keywords: Feature Selectors, Zoish, SHapley Additive exPlanations 
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1. Introduction 


Healthcare datasets, despite being typically sparse and heterogeneous, are a treasure trove of 
rich information. However, their high-dimensionality, often paired with smaller sizes, presents 
obstacles to building predictive models, with overfitting and extensive training time being 
common concerns.!* Feature selection becomes an essential strategy in this context, aimed 
at pruning redundant or less important features. This helps to minimize information loss, 
enhance model interpretability, and curtail computational demands. 

Although traditional feature selection methods, grounded in statistical concepts like corre- 
lation analysis or chi-square tests, are widely used, they tend to fall short in offering detailed 
insights into feature importance. This shortcoming, along with the manual effort required, can 
lead to a time-intensive cycle of feature selection and performance evaluation, thus requiring 
expert intervention.*° 

Our proposed feature selection tool, Zoish, aims to overcome these limitations by utilizing 
the mathematical framework of additive Shapley values. Originating from cooperative game 
theory, Shapley values offer detailed understanding of feature importance,® thereby enhancing 
both local (instance-level) and global interpretability. Moreover, the integration of Zoish with 
our scalable hyperparameter optimization package, Lohrasb,’ facilitates building models with 
optimal feature sets, all the while maintaining an industry-ready, user-friendly design. 

The structure of the paper is as follows. The first section sheds light on the core concepts of 
additive Shapley values and delves into the mathematical principles vital to Zoish. Subsequent 
to this, a section introducing a user guide for Zoish is presented. Lastly, we demonstrate the 
adaptability of Zoish through a variety of use-cases, experiments on large synthetic datasets, 
and a closing discussion. 


2. Theoretical Foundations of Zoish 


Our exploration into Zoish begins with the foundation of its theoretical structure, built upon 
Shapley additive values. First proposed in the field of cooperative game theory, Shapley addi- 
tive values have proven to be a potent tool for understanding the contribution of each feature 
to a prediction made by a machine learning model. Simply put, the Shapley value of a fea- 
ture represents the average marginal contribution of that feature, factored across all possible 
combinations of features. 


2.1. Shapley Additive Values and Feature Selection: A Game Theoretical 
Approach 


In the realm of cooperative game theory, the Shapley value denotes each player’s payoff based 
on their marginal contribution across all possible coalitions. In machine learning, ’players’ 
correspond to the features and ’game’ to the prediction task.® 
An additive cooperative game assumes that the value of any coalition equals the sum of 
its members’ independent values. This idea leads us to Shapley additive values, where the 
Shapley value of a feature equals its average marginal contribution across all feature subsets. 
Mathematically, the Shapley value for a feature i in an additive game is: 
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In this formula, |N| denotes the total number of features, S a subset of features excluding 
feature i, and |S| the number of features in subset S. The terms |S$|!(|N| — |S| — 1)! and |N]! 
calculate the number of possible permutations of features. The expression v(S U {i}) — v(S) 
computes the marginal contribution of feature i when added to subset S. 

In feature selection, the Shapley value presents a way to distribute the model’s prediction 
among the features, based on their marginal contribution.’ Shapley Additive exPlanations 
(SHAP) values, maintaining additivity, provide a unified measure of feature importance, at- 
tributing the difference between the model’s actual and expected output to each influencing 
feature.'° High Shapley or SHAP values indicate significant feature importance, while values 
hear zero suggest negligible predictive power.!! This correlation aids in reducing data dimen- 
sionality and enhances model interpretability, marking a significant stride in feature selection. 


2.2. Properties of Shapley Additive Values 


The Shapley additive values satisfy a number of properties that make them particularly useful 
for interpreting machine learning models: 


e Efficiency: The sum of the Shapley values of all features is equal to the difference 
between the prediction for an instance and the average prediction over all instances. 

e Symmetry: If two features contribute equally to all possible combinations of features, 
they have the same Shapley value. 

e Additivity: Given two games (or in our context, two models), the Shapley value of 
the combined game is the sum of the Shapley values of the individual games. 

e Nullity (Dummy): If a feature does not improve the prediction for any combination 
of features, its Shapley value is zero. 


2.2.1. Proof of Nullity (Dummy) 


The Nullity (Dummy) property states that if a feature does not change the prediction model, 
i.e., its contribution is always zero, then its Shapley value is also zero. Let f be the prediction 
model and d be such a dummy feature. 

According to the definition of Shapley values, the Shapley value of a feature is the average 
of its marginal contributions across all possible subsets of features. Therefore, the Shapley 
value sr(d) for the dummy feature d is: 


1 
ss(d) = 45 >O ISINI- ISI- DESU {d} — F(S)) 
SCN\{d} 
In the above formula, M is the total number of possible subsets of N that can be formed 
when the dummy feature d is excluded. The term |S|!(|N| — |S| — 1)! is the number of permu- 
tations of N in which the dummy feature d and the features in subset S appear together, and 
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f(SU{d}) and f(S) are the values of the game f when the dummy feature d is added to subset 
S and when it is not, respectively. 

Since d is a dummy feature, adding it to any subset S does not change the value of the 
game f. Thus, we have f(SU{d}) = f(S) for all S C N\ {d}, which simplifies the above formula 
to: 


a= 57 Y ISINI- ISI- 110) 


SCN\{d} 


sp(d) =0 


This confirms that the Shapley value of a feature that does not contribute to the prediction 
model is indeed zero, thus proving the Nullity (Dummy) property.’ 


2.3. Leveraging Shapley Additive Values for Efficient Feature Selection 


Shapley additive values have become increasingly prominent in feature selection for machine 
learning due to their robustness, efficiency, and power. Verdinelli et al. examined the explain- 
ability of machine learning models, focusing on methods such as LOCO and Shapley Values 
for assessing feature importance. Although their research indicated that Shapley Values do 
not eliminate feature correlation, they proposed new, statistically sound axioms for measuring 
feature importance.'? In a separate study, Karczmarz et al. compared Shapley and Banzhaf 
values in the context of explaining tree ensemble models. They found Banzhaf values to be 
more intuitive, efficient, and numerically robust, and introduced faster algorithms for both 
methods to improve computational efficiency.!? The SHAP (SHapley Additive exPlanation) 
library serves as a prime example of effectively leveraging the additivity and efficiency inher- 
ent in Shapley values.'* A fundamental advantage of Shapley values lies in their additivity, 
which enables fast and efficient computation, especially in the context of tree-based models. 
The SHAP and FastTreeShap libraries employ Tree SHAP, a highly efficient and accurate 
algorithm designed for tree ensembles.!*! Given the innate additivity of ensemble tree mod- 
els, which amalgamate multiple decision trees, this characteristic ensures swift and precise 
computation of Shapley values. The efficacy of Shapley values is further underscored by their 
intrinsic efficiency. This is manifested in the fact that the sum of the Shapley values for all 
features equals the difference between the prediction for a specific instance and the average 
prediction across all instances. This aspect permits a meaningful distribution of the ” credit” 
for a prediction across features, hence illuminating their relative importance. The Zoish pack- 
age,!® designed to optimize feature selection, taps into the beneficial properties of Shapley 
values. It employs the Nullity (or Dummy) property to eliminate features with Shapley values 
close to or exactly zero, indicating their minimal predictive relevance. To assist users in setting 
the cut-off level, two methods are offered: one involves setting an internal parameter called 
threshold, while the other entails defining the number of desired features to retain in the model. 
By removing these non-influential features, the package enables dimensionality reduction of 
the model without sacrificing prediction quality. The symmetry property of Shapley values 
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is also exploited by Zoish. This property mandates that if two features contribute equally to 
all possible subsets of other features, they must have identical Shapley values. By identifying 
and discarding these redundant features, Zoish facilitates the construction of models that are 
simpler and more interpretable, with no compromise on predictive power. By utilizing the 
SHAP and Fast'TreeShap libraries to incorporate Shapley additive values, Zoish implicitly 
benefits from its advantages, including the mathematical robustness and beneficial proper- 
ties of Shapley values. Therefore, these libraries and the Zoish package present themselves as 
potent instruments for feature selection, spanning a wide range of machine learning tasks. 


3. Feature Selection Approaches 


Zoish is a versatile package designed to enhance the evaluation of feature importance and 
improve the overall performance of machine learning models.'® While Zoish can function ef- 
fectively as a standalone tool for feature selection, it is engineered to be highly extensible and 
can seamlessly integrate with hyperparameter optimization packages to further refine its ca- 
pabilities. One such potent integration is with the Lohrasb package,’ which provides advanced 
tuning methods to optimize the feature selection process. However, it’s worth noting that users 
are not confined to using Lohrasb; Zoish’s flexible architecture allows for easy integration with 
other hyperparameter optimization tools as well. 


3.1. Optimization and Flexibility in Zoish 


Zoish’s integration with Lohrasb serves a dual purpose: it not only optimizes the tree-based 
estimator used for feature selection but also offers a choice of hyperparameter tuning meth- 
ods, including Optuna, GridSearchCV,!” RandomizedSearchCV,!* OptunaSearchCV, tune- 
sklearn,!® and Ray’s Tune.” This optimization is crucial for enhancing Zoish’s feature selec- 
tion capabilities, as represented in Fig 1. However, the use of Lohrasb is optional, giving users 
the freedom to employ other tree-based estimators or hyperparameter tuning engines. Even 
without hyperparameter optimization, Zoish maintains its core functionality, allowing for a 
balance between efficiency and interpretability. The importance of hyperparameter optimiza- 
tion for feature selection is further elaborated in Section 6. Therefore, while Lohrasb’s role is 
significant for optimal performance, users have the flexibility to choose an approach that best 
suits their specific needs. 


3.2. Workflow explanation 


Within a machine learning pipeline, Zoish functions as a feature selection component. The 
pipeline commences by cleaning and splitting the original dataset into training and validation 
subsets. A tree-based estimator, which is compatible with Zoish, is trained on the training 
subset. If hyperparameter tuning is applied, tools such as Lohrasb optimize the estimator 
against a specific metric, as shown in Fig 1. 

Once the estimator is optimized, it becomes an input to Zoish along with a set of pa- 
rameters, such as cross-validation settings, Shapley value calculation algorithms, and feature 
importance thresholds. Zoish computes Shapley values via either the SHAP library for smaller 
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datasets, due to its exhaustive computational approach, or Fast'TreeShap for larger datasets, 
owing to its computational efficiency. 

Based on the calculated Shapley values, Zoish automatically selects the highest-ranking 
features. The training set is then narrowed down to these selected features. Subsequently, 
these refined training and validation sets are channeled to the next steps in the pipeline, 
which usually involve fitting another predictive model. 

To ensure robustness in feature selection, Zoish employs multiple rounds of cross-validation 
on the same training set, regulated by a parameter named n_iter. 


Fig. 1. Zoish workflow 
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Documentation and code examples elucidating these operational details can be found in 
the Zoish repository. 


4. Source Code, installation and usage example 


The public repository of Zoish is available on GitHub alongside examples for end users is 
https://github.com/TorkamaniLab/zoish. Zoish package is available on PyPI and can be 
installed with pip: 


pip install zoish 


A straightforward example demonstrates how Zoish can be effectively combined with hy- 
perparameter optimizers. Both this example and the comprehensive documentation in the 
repository highlight the package’s flexibility and adaptability across various scenarios. 


import xgboost as xgb 

from sklearn.model_selection import KFold, GridSearchCV 

from zoish.feature_selectors.shap_selectors import ShapFeatureSelector 
from sklearn.pipeline import Pipeline 

from sklearn.linear_model import LogisticRegression 
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X_train, X_test, y_train, y_test = ... # Your dataset here 


grid = GridSearchCV(xgb.XGBClassifier(), {’n_estimators’: [100, 150], ’max_depth’:\ 
[6, 10], ’gamma’: [0.5, 1.0]}, cv=5, n_jobs=-1, scoring=’accuracy’).\ 
fit(X_train, y_train) 


shap_selector = ShapFeatureSelector(grid.best_estimator_, \ 
num_features=15, cv=KFold(10), n_iter=5, direction="maximum", \ 
scoring="accuracy", algorithm=’auto’, use_faster_algorithm=True) 


pipeline = Pipeline(steps=[("s", shap_selector), \ 
"m", LogisticRegression())]).fit(X_train, y_train) 


5. Use cases and applications 


5.1. Use case 1: Application to UCI breast cancer dataset - comparison 
with related X AI work 


To demonstrate the value of the Zoish feature selector in a real use case from the biomedical 
domain, we applied it to the openly available breast cancer dataset from the UCI Archive,?! 
and compared the results with a recent study evaluating different feature importance measures 
for the same dataset.?? 

The UCI dataset includes benign and malignant samples from 569 patients, 212 with cancer 
and 157 with fibrocystic breast masses. Each sample includes thirty features - ten real valued 
features for each cell nucleus (radius, texture, perimeter, area, smoothness, compactness, con- 
cavity, concave points, symmetry, fractal dimension) each reported as Mean, Standard Error 
(SE) and Worst.?? As the classes in this dataset are almost linearly separable, classification 
per se is not a difficult task; however, the most important features generally differ depending 
on the technique used.”4 To further investigate this aspect, Saarela et al.?? compared different 
feature importance measures using both linear (logistic regression) and non-linear (random 
forest) models and local interpretable model-agnostic explanations for the same dataset. In 
Fig. 2 we show the top 20 important features for the UCI Breast Cancer dataset computed 
with Zoish by training a XGboost classifier over ten folds of cross validation. The AUC for 
the trained classifier was 0.96, similar to the mean AUC reported in?? (0.99+-002). Overall, 
the most important features in Zoish/XGboost agree well with the set of nine statistically 
significant features for both RF and LR reported in?” — where, for each method, significance 
was computed through a procedure based on permutation tests — i.e. by shuffling class labels 
in the training data over hundreds of runs. Seven out of nine features deemed statistically 
significant in?? were found in Fig. 2 (Mean concave points, Worst concave points, Worst Area, 
Worst Radius, Worst Perimeter, Mean Concavity, Mean Area), with five of the most signifi- 
cant features near the top of the list (Table 1). Only one feature was labeled as not significant 
by both RF and LR ( Worst Compactness) and such feature has consistently a zero Shap value 
in Zoish/XGboost (not shown). 
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Fig. 2. Shap summary plot of the top 20 important features for the UCI Breast Cancer dataset - 

computed with Zoish by training a XGboost classifier. 
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It is worth noting that certain features which rank very high in Zoish/XGboost ( Worst 
concavity, Worst texture, Compactness SE) appeared not to be significant in RF classifica- 
tion, hence not reported in the set of nine common, statistically significant features for breast 
cancer classification in the UCI dataset. Why did Zoish / XGboost select them up then? A 
very interesting hint comes from the analysis of local importance measures for a specific set of 
observations in the UCI dataset. Again in,?? LIME (local interpretable model-agnostic expla- 
nations,”°) was used to estimate local importances for the four most interesting observations, 
(i.e. correctly classified as benign with highest probability, correctly classified as malignant 
with highest probability, misclassified as benign with highest probability, and misclassified as 
malignant with highest probability). Strikingly, the features ranking high in Zoish/XGboost 
but absent in RF were also important features in LIME, especially for the observations mis- 
classified as benign (false negatives), which are critical for medical purposes (Table 1). The 
final recommendation in?? was to combine several explanation techniques in order to provide 
more reliable and trustworthy results, but this advice can often be impractical. Conversely, 
The Zoish/XGboost feature selector appears to select relevant features both at the global 
and local level, adding more detailed explanations of feature importance (i.e., not just the 
magnitude but also the direction of change), while being fast and straightforward to use. 
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Table 1. Comparison of top features in the UCI Breast Cancer dataset 


Zoish features Significant features in LR and RF LIME - Correctly classified benign (RF) LIME - Misclassified benign (RF) 


concave points 1 concave points 3 area 3 perimeter 3 
concavity 3 area 3 perimeter 3 area 3 
concave points 3 concave points 1 radius 3 radius 3 

area 3 area 1 concave points 3 texture 3 
texture 3 perimeter 3 texture 3 concavity 3 
compactness 2 radius 3 concave points 1 area 2 

radius 3 concavity 1 concavity 3 smoothness 3 
area 2 perimeter 1 area 2 area 1 
perimeter 3 radius 1 texture 1 concave points 1 
symmetry 3 area 2 

concave points 2 concavity 3 

texture 1 texture 3 

symmetry 2 texture 1 

smoothness 3 compactness 3 

smoothness 1 radius 2 

concavity 1 perimeter 2 

radius 2 compactness 1 

fractal dimension 1 smoothness 3 

compactness 1 symmetry 3 

area 1 fractal dimension 3 


This Table is about comparison of top features in the UCI Breast Cancer dataset computed by 
Zoish/XGboost vs. Random Forest / Logistic Regression / LIME. For all features, 1=Mean, 2=Stan- 
dard Error, 3=Worst. Top 20 features in Column 2 are ranked based on permutation p-value for RF. 
In red are features found in LIME but not considered significant in RF/LR. 


5.2. Use case 2: Predict short-term PD progression status using the 
Montreal Cognitive Assessment (MoCA) 


Our model, Zoish, was put to another practical test where we aimed to predict short-term 
PD progression status using the Montreal Cognitive Assessment (MoCA) total scores for 
patients in baseline. MoCA was developed as a tool to screen patients who present with mild 
cognitive complaints and usually perform in the normal range on the MMSE (Mini-Mental 
State Examination).?° For this prediction task, we utilized the AMP-PD dataset, which is a 
comprehensive collection of data from various sources, including clinical information, genetic 
data, imaging data, and other biomarkers from individuals with Parkinson’s disease. The 
dataset consists of eight cohorts, making it a large and harmonized resource. Access to the 
data was obtained under the AMP-PD Data Use Agreement, and the information was retrieved 
from the website: https: //amp-pd.org/. Our prediction model incorporated several essential 
features from the datasets, such as ”family history,” genetic information (PRS), ”medical 
history,” ” smoking and alcohol history,” and demographic information of the participants from 
the eight cohorts. After fitting the model, we evaluated its performance using the coefficient 
of determination, commonly known as R-squared, and achieved an R-squared value of 23.6 
percent on the test dataset. Additionally, we calculated the Mean Squared Error (MSE) of 
the model to be 2.49 for the test dataset. Furthermore, the Mean Absolute Error (MAE) was 
found to be 13.04. The MAE represents the average absolute difference between the predicted 
values and the actual total score values in the test dataset. List of selected features by Zoish 
can be seen in Fig. 3 

In order to draw a comparison with another prevalent feature selector, specifically, Se- 
lectFromModel from the sklearn library, we applied it to the same dataset under identical 
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Fig. 3. List of selected features and their importance by Zoish 
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conditions. This application yielded a Mean Absolute Error (MAE) of 15.91, a Mean Squared 
Error (MSE) of 2.69, and an R-squared value of 0.12. The features selected by this approach 
are depicted in Fig. 4. 


Fig. 4. List of selected features and their importance by SelectFromModel 


Feature importances 


0.150 4 
0.125 4 
0.100 4 
0.075 4 
0.050 4 
Wa E 
hia [asa —— — 
> v v “v D A 
= £ 3 a a 2 5, a È c! a H T a 
e] È a 5, 3, > ra © S 3 5 ks) > 
= © 3 al vl S c' o al 5 a ha a ba z 
£ E a! c = a) fe] = > S S o > o v 
u o g g S 8 £ g A £ 5 5 8 3, 3 
k v, 3 2 = X = tea 5 =! 3 tl 48 3 
a o 9 =! D © v u c © B oO pe] 5 
1 o s © & a a x f] £ H a o a 
c fe] o £ 5 1 3 c D Q E l a y 
S © £ ra] 5 al = © v 5 o i 8 
5 2 © [s] 2 © 3s Y © D D a ts] 
© fe] Y T 1 = 1 5 os D g © 
e] = © ral 3 5 © e E a 
K" e] 2 
os o Y = £ fl v 2 
v fea i 8 1 = © 
E 5 a] g is) a 
i i : 


90 


Pacific Symposium on Biocomputing 2024 


As observed, Zoish not only outperforms its counterparts in terms of prediction accuracy, 
but it also excels in the selection of meaningful features. Notably, the Polygenic Risk Scores 
(PRSs) selected by Zoish have demonstrated substantial relevance to the Montreal Cognitive 
Assessment (MoCA). A prime example among these is PG'S001641, which is renowned for its 
strong correlation with the volume of white matter, normalised for head size. This particular 
PRS underscores the genetic predisposition towards the volume of white matter, a crucial 
neuroimaging measurement that relates directly to cognitive functions evaluated in the MoCA 
test. Therefore, the selection of this PRS by Zoish validates its capability in discerning features 
with profound implications for cognitive assessment. 


6. Evaluations and Performance Analysis 


To offer a comprehensive evaluation of Zoish, we performed an array of tests ranging from 
comparative analyses to hyperparameter optimization and scalability assessments. 

Comparative Analysis: We initiated our evaluation with a rigorous comparison in- 
volving 300 synthetic datasets tailored to mirror the complexities of healthcare data. These 
datasets span regression, binary classification, and multi-label classification tasks. Zoish was 
compared against six established feature selection techniques from Scikit-learn under identical 
conditions.Our findings suggested that Zoish surpassed other selectors in 77% of regression 
problems, while in multi-label classification and binary classification tasks, Zoish outperformed 
in 53% and 57% of the cases, respectively (refer to Table 2). 


Table 2. Performance comparison of Zoish with other feature selectors 


Selector Regression Binary Classification Multi-label Classification 
Zoish 77% 57% 53% 

VarianceThreshold 2% 3% 6% 

Select K Best 2% 3% 6% 

Select Percentile 2% 2% 2% 

RFE 5% 10% 6% 

RFECV 7% 10% 11% 

SelectFromModel 5% 15% 16% 


Hyperparameter Optimization: While Zoish itself is powerful, coupling it with a hyper- 
parameter optimization tool like Lohrasb significantly improves performance. We performed 
100 runs comparing Zoish’s efficacy with and without Lohrasb, and found marked improve- 
ments when paired with Lohrasb (see Fig. 5). 

Scalability: Our most recent update introduces a faster algorithm for Shapley value com- 
putation, making Zoish efficient on large datasets. In our trials, Zoish selected 500 features 
from a dataset with 10,000 samples in under 2 minutes on a machine with a 2.3 GHz Quad- 
Core Intel Core i7 processor and 32 GB RAM. 
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Fig. 5. The importance of Hyperparameter Optimization for Better Feature Selection 
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All the code for our tests is available in the public repository, allowing for independent 
verification and further exploration of Zoish’s capabilities. 


7. Discussion and Limitations 


This paper introduces Zoish, a feature selection tool built on cooperative game theory prin- 
ciples.'© Zoish has gained traction in the community, as evidenced by a significant number of 
downloads from pip-trends (https://piptrends.com/package/zoish). The tool specializes 
in optimizing predictive models, particularly in the healthcare sector, and leverages Shap- 
ley additive values for a comprehensive view of feature importance at both local and global 
scales.'° Through its Nullity property, Zoish effectively minimizes model complexity by omit- 
ting features with negligible Shapley values, thereby retaining model performance.’ The tool 
is further enriched by integration with the Lohrasb package, which aids in achieving optimal 
estimators and hyperparameter settings.’ 

While Zoish’s capabilities are robust, some limitations are noteworthy. Firstly, its computa- 
tional efficiency may be compromised when dealing with exceptionally large datasets. Secondly, 
the Shapley values employed assume feature independence and local linearity—assumptions 
that may not be fully met in complex applications like healthcare. These limitations are par- 
tially mitigated by Zoish’s tree-based modeling approach, which is robust to feature correlation 
and can capture non-linear relationships.° 

The flexibility and interpretability of Zoish make it a promising tool for future appli- 
cations in other high-dimensional data fields, including finance and e-commerce. Additional 
functionalities could be incorporated to broaden its applicability further. 

Future work will focus on extending Zoish’s utility to various high-dimensional domains 
and incorporating more algorithms and tools for an even more robust feature selection process. 
We have conducted extensive tests on synthetic datasets mimicking real-world complexities 
in healthcare, which are detailed in Section 6. These tests demonstrate Zoish’s reliability and 
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adaptability, even under challenging conditions. 
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The concept of a digital twin came from the engineering, industrial, and manufacturing domains to 
create virtual objects or machines that could inform the design and development of real objects. This 
idea is appealing for precision medicine where digital twins of patients could help inform healthcare 
decisions. We have developed a methodology for generating and using digital twins for clinical 
outcome prediction. We introduce a new approach that combines synthetic data and network science 
to create digital twins (i.e. SynT win) for precision medicine. First, our approach starts by estimating 
the distance between all subjects based on their available features. Second, the distances are used to 
construct a network with subjects as nodes and edges defining distance less than the percolation 
threshold. Third, communities or cliques of subjects are defined. Fourth, a large population of 
synthetic patients are generated using a synthetic data generation algorithm that models the 
correlation structure of the data to generate new patients. Fifth, digital twins are selected from the 
synthetic patient population that are within a given distance defining a subject community in the 
network. Finally, we compare and contrast community-based prediction of clinical endpoints using 
real subjects, digital twins, or both within and outside of the community. Key to this approach are 
the digital twins defined using patient similarity that represent hypothetical unobserved patients with 
patterns similar to nearby real patients as defined by network distance and community structure. We 
apply our SynTwin approach to predicting mortality in a population-based cancer registry (n=87,674) 
from the Surveillance, Epidemiology, and End Results (SEER) program from the National Cancer 
Institute (USA). Our results demonstrate that nearest network neighbor prediction of mortality in this 
study is significantly improved with digital twins (AUROC=0.864, 95% CI=0.857-0.872) over just 
using real data alone (AUROC=0.791, 95% CI=0.781-0.800). These results suggest a network-based 
digital twin strategy using synthetic patients may add value to precision medicine efforts. 


Keywords: Digital twins; Precision medicine; Artificial intelligence; Synthetic data. 


1. Introduction to Digital Twins 


The concept of a digital twin came from the engineering, industrial, and manufacturing domains 
and refers to the creation of virtual objects or machines that can inform the design and development 
of real objects (Grieves & Vickers 2017). The promise of this approach in manufacturing is to reduce 
costs, improve efficiency, reduce waste, and minimize variability among products (Attaran et al. 
2023). This is accomplished by enumerating and evaluating design parameters of the digital twin of 
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a physical product with some measurable outcome that can then be applied to manufacturing. Use 
cases in industry include product design, process design and optimization, supply chain 
management, preventive system maintenance, farm management, weather modeling, soil 
management, facility and operations design, construction, etc. (Attaran & Celik 2023). Consider the 
use case of monitoring weed pressure and crop growth (Verdouw et al. 2021). Data on crops, weeds, 
weather, and soil conditions are collected from crop sensors. These data are used to build a digital 
twin of the crops where parameters for a weeding machine can be enumerated and evaluated. 
Optimized parameters from the digital twin can then be put into practice for weed management with 
benefits including crop weight, size, and yield. 

The successful use of digital twins in industry has opened the door for their use in medicine and 
healthcare where they represent virtual or simulated patients that could be used to inform health 
outcomes or treatment decision for real patients (Acosta et al. 2022). This idea of using digital twins 
in precision medicine has been explored for asthma management (Drummond et al. 2023), the 
treatment of immune-mediated diseases (Benson 2023), and dementia care (Wickramasinghe et al. 
2022), for example. Despite the interest in this area, the development of computational methods and 
open-source software for creating and using digital twins has been slow to emerge. This is likely 
due to the industry focus on creating twins of mechanical objects using principles of physics and 
engineering that do not exist with enough detail to create simulated patients with molecular, cellular, 
physiological, and anatomical realness and appropriate environmental and societal context. Some 
of these challenges have been previously discussed (Benson 2023). 

The goal of the present study was to create a computational methodology for generating digital 
twins based on synthetic patients rather than biophysics. The generation of synthetic data is 
becoming a mature field (Gonzales et al. 2023) and lends itself well to the digital twin strategy. The 
working hypothesis is that the correlation structure of clinical variables among patients can inform 
the creation of digital twins that represent unobserved individuals. In other words, patient 
relationships might be able to serve as a surrogate for biophysical realizations. The advantage of 
this surrogate approach is that it can be implemented and evaluated today while we wait for better 
and more complete biophysical models that could take decades to develop and validate. 

We introduce here a new approach that combines synthetic data and network science to create 
digital twins (i.e. SynTwin) for precision medicine. Our approach starts by estimating the distance 
between all subjects based on their available features. We explore here several different distance 
metrics. Second, the distances are used to construct a network with subjects as nodes and edges 
defining distance less than the percolation threshold. Third, communities or cliques of subjects are 
identified using a Multilevel community detection algorithm. Fourth, a large population of synthetic 
patients or subjects are generated. Several synthetic data generators were evaluated. Fifth, digital 
twins are selected from the synthetic patient population that are within a given distance defining a 
subject community in the network. By design, the digital twins represent unobserved hypothetical 
patients with similar clinical profiles as their real patient counterparts. Finally, we compare and 
contrast community-based prediction of clinical endpoints using real subjects, digital twins, or both. 
This is compared to predictive performance using real patients outside the community as a baseline. 
We apply our synthetic digital twin (SynTwin) approach to predicting mortality in a population- 
based cancer registry (n=87,674) from the Surveillance, Epidemiology, and End Results (SEER) 
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program from the National Cancer Institute (USA). Bootstrapping is used to assess the standard 
error of all performance metrics and to estimate 95% confidence intervals for hypothesis testing. 
Our results demonstrate that nearest network neighbor prediction of mortality in the SEER breast 
cancer data is significantly improved with digital twins. These results support a growing number of 
studies highlighting the benefit of synthetic data in other applications. 


2. Methods 


We describe here the data used and the detailed methods for the SynTwin approach. 


2.1. Cancer Registry Data 


We chose a population-based cancer registry from the Surveillance, Epidemiology, and End 
Results (SEER) program from the National Cancer Institute (USA) for this study due its large 
sample size and ease of access by simple registration with an email address to allow for 
reproducibility. We utilized SEER Stat Version 8.4.1 for data retrieval. 

To extract patient data specifically for breast cancer, we applied the following filters: 


Database name: Incidence - SEER Research Data, 17 Registries, Nov 2021 Sub (2000-2019) - 
Linked To County Attributes - Time Dependent (1990-2019) Income/Rurality, 1969-2020 Counties. 


Additional filter criteria included: 


Site recode ICD-O-3/WHO 2008 = ‘Breast’ AND Year of diagnosis = ‘2010’, ‘2011’, ‘2012’, 
‘2013’, 2014’, ‘2015’ AND {Vital status recode (study cutoff used) = ‘Alive’ OR {Vital status 
recode (study cutoff used) = ‘Dead’ AND SEER cause-specific death classification = ‘Dead 
{attributable to this cancer dx}’}} 


We chose to exclude data from the years 2015-2019 due to the significant imbalance observed 
within that period. Specifically, the data exhibited a notable disparity between the number of 
surviving patients and the number of deceased cases. More than 80% of the patients within that 
timeframe were still alive, rendering the dataset heavily skewed. Our criteria yielded 324,117 
patient records. Removing redundant entries resulted in 231,930 records, consisting of 188,093 
Alive cases and 43,837 Dead cases. Subsequently, we conducted a stratified sampling based on vital 
status to create a balanced dataset for prediction purposes. We retained all Dead cases (n=43,837) 
and randomly undersampled the same number of Alive cases (n=43,837). This process yielded a 
total of 87,674 records for our final dataset. We partitioned this sample into a training dataset 
(n=57,674) and a validation dataset of approximately 1/3 of the sample (n=30,000) to assess internal 
validity of the results. The training data was used to generate the digital twins while the validation 
dataset was held out for making predictions using the real patient data and their network and 
communities. The data processing steps are outlined by the flowchart in Figure 1. 

Features included age, year of diagnosis, sex, race, ICDO3, tumor grade, laterality, primary site, 
survival in months, tumor sequence, diagnostic confirmation, ICCC site, combined summary stage, 
and vital status (Alive or Dead). The last feature was used as the clinical outcome of class variable 
for prediction. 
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2.2. Bootstrapping 


A central goal of this study was to compare and contrast different methods for estimating patient 
distances, different methods for generating synthetic data, and different approaches to using digital 
twins to predict outcome. In order to generate a sampling distribution of all objective functions we 
carried out 1000-fold bootstrapping by sampling 90% of patients in the holdout or validation data 
with replacement in each community of size 10 or greater a total of 1000 times. Each performance 
measure was estimated using all 1000 replications to derive its empirical distribution. This allowed 
95% confidence intervals to be estimated for all performance metrics. These were used for 
uncertainty quantification, statistical comparisons, and hypothesis testing. 


C 
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Digital Twins 
Fig. 1. Flowchart for data processing and analysis. 


2.3. The SynTwin Algorithm for Network-Based Generation of Digital Twins from 
Synthetic Data 


We describe here our six-step algorithm for generating digital twins and using them for 
predicting mortality in the SEER data. This involves computing patient distances based on clinical 
features, constructing a network using distances based on the percolation threshold, identifying 
patient communities, generating synthetic patients, selection of digital twins, a nearest neighbor (i.e. 
within community) prediction of mortality. 


2.3.1. Distance Measures 


The first step is to estimate the distances between patients. We evaluated four different distance 
metrics. These include Euclidian, Manhattan, Cosine (Lee et al. 2015), and Gower (Gower 1971). 
Each has different strengths and weaknesses. For example, Gower is appealing because it is scale- 


99 


Pacific Symposium on Biocomputing 2024 


invariant and works well with both discrete and continuous data. Further, as shown in the results, 
this distance measure yielded the best results. 


2.3.2. Network Construction 


The second step is to build a network with patients as nodes and edges with weights based on 
the estimated distances in the first step. To prevent an uninformative fully connected network, we 
used a percolation threshold equal to the first upward inflection point of the convex part of the 
sigmoid relationship between edge weight (X axis) and network size (Y axis) as an objective 
approach to filtering edges. 


2.3.3. Community Detection 


The third step is to detect communities of patients (i.e. cliques or modules) in the network. There 
are many different community detection algorithms for large networks. We selected the Multilevel 
algorithm (Blondel et al. 2008) for this study. This algorithm uses a heuristic for modularity 
optimization and is designed specifically for large networks. The Multilevel algorithm was shown 
to outperform other community detection algorithms available at the time and with better time 
complexity (Blondel et al. 2008). Further, a more recent study compared this algorithm with seven 
others on several graph benchmarks and showed that the Multilevel algorithm was best for both 
accuracy and time complexity (Yang et al. 2016). In our study, we varied the resolution parameter 
settings to maximize the number of communities with at least 10 subjects. This yielded between 
11,000 and 19,000 communities across the four different distance metrics we investigated. 


2.3.4. Synthetic Data Generation 


The fourth step is to generate synthetic patients to be used as the population to select digital 
twins from. We evaluated three synthetic data generation algorithms. The first, categorical latent 
Gaussian process (CLGP), uses continuous latent variables to represent categorical variables that 
can then be modeled using a Gaussian process (Gal et al. 2015). Here, synthetic data can be 
generated by sampling from the posterior distribution of the latent variables. The second, mixture 
of product of multinomials (MPoM), uses a probabilistic model to generate synthetic data with 
similar statistical properties to the original data (Dunson & Xing 2009). The third, multi-categorical 
extension of a medical generative adversarial network (MC-MedGAN), uses two adversarial neural 
networks to generate synthetic data (Choi et al. 2017). Here, The first network learns to generate 
realistic synthetic data, and the second one attempts to distinguish between real and synthetic data 
generated by the first network. Autoencoders are used to transform the multivariate categorical data 
to continuous values, which are then used by the GAN to generate synthetic data. 

All three of these methods were recently evaluated and compared (Goncalves et al. 2020). We 
used the following performance metrics highlighted in this study to evaluate each approach: 
pairwise correlation difference (PCD), log-cluster (LC), support coverage (SC), and cross- 
classification (CrCl). The PCD metric is computed as the Frobenius norm difference between 
Pearson correlation matrices of real and synthetic datasets. It measures how well a method captures 
the correlation between variables. The LC metric assesses the similarity in latent structure between 
real and synthetic datasets using k-means clustering. The SC metric quantifies the extent to which 
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the variables support in real data is captured in synthetic data. It is calculated as the ratio of the 
cardinalities of number of levels (support) for each variable in real and synthetic data. CrCl assesses 
how accurately a synthetic dataset replicates the statistical dependence found in real data using a 
classifier. 

We used the best hyperparameters reported for “small-set” in the study (Goncalves et al. 2020) 
to set up our synthetic data generation algorithms considering the smaller number of variables in 
our dataset. For CLGP we used 100 inducing points and 5-dimensional latent space. For MPoM we 
set the number of clusters (k) to 30, concentration parameter (aœ) to 10, Gibbs sampling steps to 
10,000, and burn-in steps to 1,000. For MC-MedGAN we used a learning rate le-3 and batch size 
100 samples. We applied L-2 regularization on the weights of the neural network with A=1e-3 and 
set temperature parameter for Gumbel-Softmax trick to t=0.666. The autoencoder part was built 
with a code size 64, two encoder layers (hidden size — 256 and 128), and two decoder layers (hidden 
size — 256 and 128). The GAN part consisted of one generator step with two generator layers 
(hidden size — 64 and 64) and two discriminator steps each with two discriminator layers (hidden 
size — 256 and 128). The autoencoder and the GAN were trained for 100 and 500 epochs, 
respectively. 


2.3.5. Selection of Digital Twins 


The fifth step is to select digital twins from a population of synthetic patients. For a synthetic 
twin to be a digital twin it must be within some distance of one or more real patients such that the 
clinical features can represent realistic unobserved measures and outcomes. For each community 
we Selected those synthetic patients whose distances places them within that community. We refer 
to these virtual patients as digital twins of the real patients in the community. Only those digital 
twins in a community are used for prediction of mortality. 


2.3.6. Prediction of Mortality 


The final step is to use features from real patients and/or digital twins to predict mortality (Alive 
or Dead) using a majority vote using the study design described in the next section (2.4). This 
prediction strategy resembles k-nearest neighbor classification. We estimated six different 
classification performance measures for predicting mortality across 1000 bootstrapped samples of 
the holdout data sampled with replacement from each community with at least 10 patients. These 
included accuracy, balanced accuracy area under the receiver operating characteristic curve 
(AUROC), precision, recall, and Fl. The mean of each performance metric across the 1000 
bootstrapped datasets was reported along with the bootstrapped 95% confidence interval (CI). 


2.4. Study Design and Analysis. 


A central goal of this study is to evaluate whether digital twins add any value to predicting 
mortality beyond that provided by data from the real patients. To answer this question, we 
developed the following study design (Figure 2). Here, we evaluated prediction of mortality in 
target patients (black circle) using real patients (A), digital twins (B), real patients and digital twins 
(C), the closest digital twins equal to the number of real patients in the community (D), real patients 
and closest digital twins (E), and real patients outside the community (F) as a control for the value 
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of considering communities. Here, each subject in a community alternate as the target patient in a 
leave-one-out style analysis. 

A total of one million synthetic patients were generated from the training data using the best 
synthetic data generation algorithm (MPoM). We predicted target patient mortality using the nearest 
neighbor majority vote classification method in the holdout or validation dataset. We estimated 95% 
confidence intervals for each of the classification performance metrics and statistically compared 
distance metrics, synthetic data generation algorithms, and study designs. 


A B 


O 


O Real Patients @ Target Patient © Digital Twins 
Fig. 2. Study design for comparing outcome prediction using real patients and/or digital twins. The large 
circles represent a community within the patient network. Prediction of the target patient is carried out 
using real patients (A), digital twins (B), real patients and digital twins (C), the closest digital twins (D), 
real patients and closest digital twins (E), and real patients outside the community (F). 


3. Results 


Table 1 summarizes the performance metrics for the three synthetic data generators considered. 
Across all metrics, mixture of product of multinomials (MPoM) performed significantly better than 
the other two methods with nonoverlapping 95% confidence intervals. Consider for example that 
MPoM had a cross-classification (CrCl) of 0.982 indicating a very high degree of correlation 
between the same features in the real dataset and in the synthetic dataset. This was significantly 
higher than the CrCl for MC-MedGAN (0.759) and CLGP (0.645) with nonoverlapping confidence 
intervals when compared to MPoM. This was true for the other metrics. The only exception was the 
categorical latent Gaussian process (CLGP) for coverage which was comparable to MPoM. These 
results mirror a previous evaluation of these algorithms using the SEER data where MPoM 
outperformed the MC-MedGAN adversarial neural network approach (Goncalves et al. 2020). 
Therefore, we selected MPoM as our synthetic data generator and used it for the remainder of the 
study. 
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Table 1. Comparison of synthetic data algorithms (columns) for four performance metrics (rows). Bolded metric 
values are significantly better than the others. 


CLGP MC-MedGAN MPoM 
Metric Mean 95% CI Mean 95% CI Mean 95% CI 
CrCl 0.645 0.533, 0.757 0.759 0.634, 0.884 0.982 0.941, 1.022 
LC -1.545 -1.594, -1.496 -2.941 -4.211, -1.672 -5.191 -6.146, -4.236 
SC 1.000 0.999, 1.001 0.830 0.625, 1.036 0.989 0.978, 1.000 
PCD 1.723 1.453, 1.992 2472. 1.007, 4.538 1.012 0.720, 1.305 


Table 2. Comparison of study design performance as measured by AUROC for each distance measure. Bolded metric 
values are significantly better than the others. 


Cosine Euclidean Gower Manhattan 


Design* mean 95% CI mean 95% CI mean 95% CI mean 95% CI 

0.800  0.792,0.808 0.807 0.800,0.814 0.791 0.781, 0.800 0.800 0.792, 0.807 

0.793 0.785, 0.801 0.799 0.792, 0.806 0.784 0.774, 0.794 0.792 0.784, 0.800 

0.793 = 0.785, 0.801 0.798 0.791, 0.805 0.783 0.773, 0.793 0.791 0.783, 0.798 

0.840 0.833,0.847 0.848 0.842,0.854 0.864 0.857, 0.872 0.852 0.845, 0.858 

0.840 0.833,0.847 0.845 0.839,0.852 0.852 0.844, 0.860 0.846 0.839, 0.852 

0.510 0.500, 0.521 0.512 0.503, 0.522 0.494 0.482, 0.507 0.485 0.475, 0.495 

*Real patients (A), digital twins (B), real patients and digital twins (C), closest digital twins (D), real patients and closest 
digital twins (E), and real patients outside the community (F). 


Hugga w > 


Table 2 summarizes the AUROC for predicting mortality in the holdout or validation data for 
each of the four distance metrics and each of the six study designs (A-F, see Figure 1). Study designs 
D and E had significantly higher AUROCs than the others but were not significantly better than 
each other given overlapping confidence intervals. Unique to study designs D and E are the presence 
of digital twins selected to be close to the target patient being predicted. The performance of D and 
E was significantly higher for the Gower distance than Cosine, Euclidian, or Manhattan. Therefore, 
we are reporting the mean AUROCs for Gower distance. These patterns of significance were similar 
for accuracy, balanced accuracy, and the other performance metrics (tables not shown). For 
example, the Gower accuracies for D and E were 0.788 (95% CI=0.780-0.797) and 0.781 (95% 
CI=0.772-0.790), respectively. The Gower accuracy for just the real patients (A) in the community 
was 0.719 (95% CI=0.710-0728). The mean balanced accuracies were very similar for D (0.789), E 
(0.783), and A (0.721) suggesting that there were no biased accuracies due to imbalanced data. Thus, 
the accuracies associated with including close digital twins within communities was significantly 
higher than that for just real patients within communities. 

Interestingly, the performance of A (real patients only), B (digital twins only), and C (real 
patients and digital twins) were not significantly different from one another across the different 
distance metrics including Gower. It is important to note that F (real patients outside the community) 
had an AUROC of approximately 0.50 as might be expected by chance given these patients have a 
distance that exceeds the percolation threshold and places them outside the community. Thus, the 
distance from the target patient being considered for prediction plays an important role in predictive 
accuracy and is highly relevant for precision medicine where context is a key consideration. An 
example network of real patients for three communities is shown in Figure 3 along with the 
corresponding digital patients. 
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4. Discussion 


We have developed a new digital twin approach to improve the prediction of clinical endpoints. 
Our approach combines network science to model patient similarity and synthetic data generation 
to generate digital twins (SynTwin). Key to SynTwin is using patient similarity to synthesize nearby 
digital twins that represent hypothetical unobserved patients with clinical data correlations that are 
consistent with real patients. This distance-based approach is different than the digital twin 
approaches from industry that rely on well-known physical principles that govern a complex system 
(Attaran & Celik 2023; Attaran et al. 2023; Grieves & Vickers 2017). Biophysical properties 
governing health are not well known and are often only available for certain cellular or physiological 
processes. Indeed, simulating a single cell is quite challenging for a number of reasons including 
the lack of physics-based models (Thornburg et al. 2022). It is our working hypothesis that distance- 
based digital twins will be useful for informing patient outcomes above and beyond that provided 
by the observed clinical data. Indeed, our results suggest that generating and selecting digital twins 
close to the target patient whose outcome is being predicted significantly improves predictive 
performance above and beyond the real patients in the community. Choosing real patients outside 
the community for predicting target patients inside a community was not better than flipping a coin. 


@ ® @ Real patients inside community 259 
ay @ Real patients inside community 295 
i (3) Real patients inside community 296 
@ Real patients outside the three communities 


@ Digital Twins inside the three communities 


Fig. 3. Section of the network showing three communities of real patients (orange, green, and blue circles). Also 
shown are the digital patients (small purple circles) and real patients outside the communities (grey circles). 


The generation and use of synthetic data for biomedical research is in and of itself not new. A 
recent review highlighted more than 70 published papers representing at least seven different use 
cases for synthetic data (Gonzales et al. 2023). Most of the use cases involve generating a synthetic 
dataset that can be used to avoid the privacy and security concerns of real data. For example, a 
synthetic dataset could be distributed to students to use for learning objectives without fear of 
identifying real patients. Other use cases involve using synthetic data to benchmark algorithms, 
evaluate information technology software, and public release of data. A very specific use case is to 
allow investigators to test a hypothesis without the need for Institutional Review Board (IRB) 
approval and the time it takes to retrieve data from an electronic health record which is a process 
that can take months depending on the complexity of the data and the wait time for available 
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qualified personnel. Any interesting patterns found in the easily available synthetic data could then 
justify the time and expense of retrieving real data to confirm the finding before publication as has 
been suggested (Foraker et al. 2018). This approach was recently evaluated by comparing statistical 
and machine learning results obtained from real patient data and a synthetic derivative generated 
using a commercially available platform (Foraker et al. 2020). Similar results were seen when using 
a large integrated data resource (Foraker et al. 2021). In each case, the authors were able to draw 
the same conclusions from the analytical results using both real and synthetic datasets. 

The use of synthetic data to generate digital twins was not mentioned in the review by Gonzales 
et al. (2023). However, using synthetic data to improve the sample size of a real dataset for 
improving predictive accuracy was specifically discussed. A study evaluating the addition of 
synthetic data to a real dataset showed that variance improved and five machine learning algorithms 
had improved prediction of heart disease (Aljaaf et al. 2016). The idea that synthetic data can 
improve machine learning performance has been observed in the image analysis domain. For 
example, a synthetic image generation approach using general adversarial networks (GANs) has 
been shown to improve image segmentation when the number of training examples is small 
(Thambawita et al. 2022). This approach may have clinical applications. For example, a recent study 
showed that synthetic colonoscopy images with polyps can improve the sensitivity of a deep 
learning neural network to detect polyps in real images (Adjei et al. 2022). This may be true in 
ophthalmology as well (You et al. 2022). Our observation that synthetic data may improve the 
performance of predictive accuracy is consistent with these studies. More studies are needed to 
validate this phenomenon. 

Most synthetic data generation studies have focused on generating and using an entire synthetic 
dataset and checking to make sure the patterns detected by a machine learning algorithm are similar 
(Gonzales et al. 2023). Our SynTwin digital twin approach is different in the sense that we are using 
patient similarity and network community structure to select synthetic patients (i.e. twins) that can 
inform clinical outcome prediction. This is a more targeted approach that is much more consistent 
with the goals of precision medicine where treatment decisions and clinical outcomes are assessed 
in patient subgroups with similar characteristics. As such, this represents a fundamental shift in how 
synthetic data are used and may be more informative for clinical decision support. 

Despite progress in this area, there are some possible limitations and challenges for moving 
forward. First, we applied our method to a dataset with a large sample size and a small number of 
features. On one hand, this was an ideal dataset to evaluate a new approach. Further, this dataset is 
publicly available, has been carefully curated, and has been well studied for understanding cancer 
risk and outcomes. However, the question remains of how the SynTwin approach will scale to 
hundreds or thousands of features or how it will behave when the synthetic data are generated from 
a dataset with small sample size. Further, the validation data was derived from the same cohort. 
Second, SynTwin is highly dependent on the community structure of the network. Not every patient 
is part of a community and prediction of outcomes in those patients may need to be performed using 
standard machine learning (ML) methods. Thus, a hybrid SynTwin-ML approach may need to be 
developed to make sure those patients are considered fully. Thirdly, it is of great interest to develop 
formal statistical inferential procedures to quantify the uncertainty of the subsequent analyses 
including estimation and prediction. Intuitively the generated digital twins should be weighted 
differently from the real patients in the precision of the downstream analyses. Finally, our 
implementation of SynTwin relied on bootstrapping to assign confidence intervals to performance 
metrics. This adds 1000-fold more computation time which might be prohibitive for larger datasets 
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with more features. Future studies will need to balance the need for statistical inference with 
computing resources that are available. This study benefited from access to a 2000-core high- 
performance computing system to carry out all computations. 

Precision medicine relies heavily on artificial intelligence and machine learning methods to 
develop models for predicting disease risk and patient outcomes in a manner that takes into account 
the uniqueness of the patient in question and other patients with similar profiles (Rajpurkar et al. 
2022). The SynTwin digital twin strategy we presented here takes a step toward the use of synthetic 
data to augment the prediction of clinical outcomes by generating hypothetical unobserved patients 
to be used alongside real patients. The use of digital twins in medicine and biomedical research is 
in its infancy. We have a lot to learn from industrial uses of this approach and will need to develop 
new algorithms and software that consider the unique aspects of patients and their data. We agree 
with others who have speculated that digital twins will have a big impact on research and patient 
care but that new biophysical, computational, and statistical methods are needed (Acosta et al. 2022; 
Armeni et al. 2022; Attaran & Celik 2023; Attaran et al. 2023; Kamel Boulos & Zhang 2021). 
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Classical machine learning and deep learning models for Computer-Aided Diagnosis (CAD) 
commonly focus on overall classification performance, treating misclassification errors (false 
negatives and false positives) equally during training. This uniform treatment overlooks the 
distinct costs associated with each type of error, leading to suboptimal decision-making, 
particularly in the medical domain where it is important to improve the prediction sensitiv- 
ity without significantly compromising overall accuracy. This study introduces a novel deep 
learning-based CAD system that incorporates a cost-sensitive parameter into the activation 
function. By applying our methodologies to two medical imaging datasets, our proposed 
study shows statistically significant increases of 3.84% and 5.4% in sensitivity while main- 
taining overall accuracy for Lung Image Database Consortium (LIDC) and Breast Cancer 
Histological Database (BreakHis), respectively. Our findings underscore the significance of 
integrating cost-sensitive parameters into future CAD systems to optimize performance and 
ultimately reduce costs and improve patient outcomes. 


Keywords: Misclassification errors; Cost-sensitive activation function; Convolutional neural 
network 


1. Introduction 


Machine learning (ML) models have been developed to identify patterns in data across various 
domains, including computer-aided diagnosis,! public health,? and defect detection.*? Gener- 
ally, these ML models are optimized based on overall prediction accuracy across all classes 
and data points, assuming that misclassification errors are equal.* However, this assumption 
can be perilous in classification problems where misclassifying a positive instance carries a 
higher cost than misclassifying a negative instance. Particularly in the medical domain, a 
false negative error will likely have much greater consequences than a false positive. 

To address this challenge, we propose a novel cost-aware deep learning-based CAD sys- 
tem that incorporates different cost values into the activation function to boost the model’s 
sensitivity. By fine-tuning the cost values associated with false positive and false negative in- 
stances, we can significantly increase true positives. Our contributions are twofold: 1) a CAD 
training framework designed to enhance sensitivity while maintaining overall accuracy, and 2) 
a proof-of-concept demonstrating the value of incorporating cost values as hyperparameters 
in future CAD systems. 


© 2023 The Authors. Open Access chapter published by World Scientific Publishing Company and 
distributed under the terms of the Creative Commons Attribution Non-Commercial (CC BY-NC) 
4.0 License. 
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2. Related Work 
2.1. Cost-Sensitive Learning in Non-Medical Domains 


In recent years, cost-sensitive learning has gained popularity as a valuable tool in non-medical 
domains to tackle class imbalance* '° and to address its associated costs of misclassification. 

Prior research by Zhu and Wan" proposed a cost-sensitive learning method for semi- 
supervised hit-and-run analysis to handle the data imbalance issue which significantly im- 
proved model’s performance even with a small proportion of labeled historical data. Khosh- 
gotaar et al.!? introduced cost-sensitive learning into Software Defect Prediction (SDP) and 
used a boosting method to build software quality models. Le et al.'’ implemented a hybrid ap- 
proach by combining oversampling techniques and cost-sensitive learning, which significantly 
improved bankruptcy prediction performance. Devi et al.!4 proposed a cost-sensitive weighted 
random forest algorithm for effective credit card fraud detection. The model assigns more 
weight to minority instances during training, resulting in improved performance compared to 
existing random forest techniques. Xiao et al.!° integrated a group handling neural network- 
based cost-sensitive semi-supervised selective ensemble model for credit-scoring problems. 

Other prior work focused on improving the overall prediction performance by modifying the 
loss function to consider different cost values for various misclassifications. Li et al.1° proposed 
a pixel-based adaptive weighted cross-entropy loss function to facilitate road crack detection. 
Wang et al.!’ also introduced a novel cost-sensitive loss function for semantic segmentation 
of remote sensing images. More recently, Li et al.'® constructed a new cost-sensitive loss 
function that incorporates the cost difference caused by misclassification between different 
classes proving its ability to enhance the model’s effectiveness. 


2.2. Cost-Sensitive Learning Applied to Medical Diagnosis 


Research studies on the ML application to medical diagnosis typically employ traditional ML 
algorithms and advanced algorithms via ensemble learning,!® evolutionary algorithms,” sparse 
autoencoders (SAE).?! However, few research works have conducted cost-sensitive learning in 
medical diagnosis. Recently, Manop® developed a cost-sensitive XGBoost model for breast can- 
cer detection and evaluated it on four breast cancer datasets with uneven class distribution, 
achieving accuracy ranging from 95.99% to 96.43%. Ali et al. developed a method that com- 
bines cost-sensitive learning and ensemble learning techniques to predict breast cancer. The 
ensemble learning methods include GentleBoost, Bagging, and Adaptive Boosting, resulting in 
a 3.91% improvement. Zieba et al.” proposed the combination of ensemble learning and cost- 
sensitive Support Vector Machine (SVM) to address the lung cancer patients’ post-operation 
life expectancy. They observed that patients not covered by the minority rules have a 97% 
chance of surviving the considered survival period. Ali et al.?? applied cost-sensitive ensem- 
ble methods in the classification of chronic kidney disease (CKD) which incorporates feature 
ranking capabilities instead of enhancing predictive accuracy. Cost-sensitive deep neural net- 
works (CSDNN) have also been developed by Wang et al.?3 to predict hospital readmission, 
achieving a significant improvement in accuracy of 6% and 4% for 1-year and 30-day read- 
mission prediction, respectively. Mienye and Sunt? developed robust cost-sensitive classifiers 
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for predicting medical diagnosis by modifying the objective functions of algorithms such as 
logistic regression, decision trees, extreme gradient boosting, and random forest. They tested 
these classifiers on four medical datasets and demonstrated that cost-sensitive methods yield 
improvements ranging from 1% to 4% compared to the standard algorithms. 

These works® *:!9:3 suggest that incorporating cost-sensitive learning with specific cost val- 
ues during misclassifications into ML models can improve overall classification performance. 
However, these studies focus solely on improving overall classification results without deter- 
mining the impact of different cost values on sensitivity and specificity metrics. Our study 
expands the integration of the misclassification costs into the learning algorithm by optimizing 
the diagnostic interpretation sensitivity without sacrificing the overall diagnostic performance. 


3. Methodology 
3.1. Convolutional Neural Networks 


A classical CNN model (Figure 1) comprises convolutional layers and fully connected layers. 
The convolutional layers are designed to extract features, while the fully connected layers 
are responsible for classification. For our study, we focus on the binary classification problem 
distinguishing between malignant and benign cases. To achieve this, we employed a single 
neuron in the output layer with the sigmoid activation function. The output of the sigmoid 
activation function determines the predicted label. 


g | 
Convolution Pooling | Flattening 
= ==> ——— 
Pooling Layer 


Input Image Convolution Layer 


ô 
g > 
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Sigmoid Predicted 
Activation Probability and 
Function Predicted Class 
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Fig. 1: CNN architecture overview. 


Let of be the output of the current layer’s neuron with index reur at epoch t for any 
image X;: 


dpre 
t 
O8).. EDD We rose (1) 
Tpre =l 


t ; ; ; i 
where wP na represents the weights between the neuron indexed as rpre in the previous 


layer and the neuron indexed as recur in the current layer at epoch t. dpre represents the number 
of neurons in the previous layer, and g represents the activation function applied to each output 
value of a neuron AO. from the previous layer rpre. 

Given an image X;, where i represents the image index, we denote y; as its actual label 
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and p; as its prediction probability: 


py? = On JE o (2) 
1 4 e7 Oras 


where o represents the sigmoid activation function applied to the output layer neuron. The 
predicted y; is calculated as follows: 


_ fi, ifo(0®,.) >0.5 
Yi = 
0, if o(0”.,) <0.5 


(3) 


Using the actual label y; and its predicted probability ;, the loss for instance X; at epoch 
t is calculated using Binary Cross Entropy (BCE) loss: 


Loss} ym, = —[yi loge (i) + (1 — yi) loga(1 — ĵi) (4) 


During the backpropagation, the model updates its weights to minimize the overall loss 
E (Equation 5) at epoch t, which is the average loss across all N training instances. 
(t) ~ 
t t A 
Lirm = o Loss, (Bi) (5) 
= 
where ERM denotes Empirical Risk Minimization. 


3.2. Cost-aware CNN model 


To direct the performance of the CNN model towards the true positives (malignant cases 
denoted by 1), we propose a cost-sensitive activation function a(t = c(y,g)®) that penalizes 
more false negatives than the false positives by assigning higher costs to outcomes that are 
misclassifications of true positives and lower costs for misclassifying true negatives (benign 
cases denoted by 0). Grounded in the work by Li et al.,!8 we define an inverse relationship 
between the cost of false negatives c(1,0) and the cost of false positives c(0,1) and restrict the 
values of c(1,0) to be greater than 1: 
1 

c(0,1) = z0) (6) 

The activation function remains the same for correctly classified cases, and therefore, the 
costs for true positives c(1,1) are equal to the costs for true negatives c(0,0) and are equal to 
1. 

By integrating the cost values into the activation function, Equation (2) is transformed 
into Equation (7), denoting that the predicted probabilities 6 at epoch t are now influenced 
by the costs associated with each type of output: 

A(t t —\(t 1 
p® =o(08.,c(y,9)”) TO OPENS 


(7) 


Implicitly, the new Ets loss function at epoch t, which is based on the predicted 
probabilities, will be dependent on the cost values: 


Dip: Cost — >» Loss, (Ê p (8) 
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3.3. Cost Analysis w.r.t. Sensitivity and Specificity 


Maximizing both sensitivity and specificity simultaneously is not possible as they are inversely 
related.?4?> However, introducing cost-values in the activation function of the outcome layer 
allows optimizing the performance of each without a decline in the overall accuracy. Here we 
illustrate two examples that show the impact of the costs on the performance of a CNN model. 


3.3.1. Handling False Negative Cases 


In a false negative case, where the actual label y = 1 and the predicted label y = 0, to achieve 
J = 0, O,.,, must be a negative value. Using Equations (2) and (4), and if c(1,0) = 1, which 
represents no cost value being used, we obtain the loss function: 


1 
Loss pop = — logy ~ o0 (9) 
$ 1 + e Tcur,i 


If we introduce c(1,0) > 0 in the activation function, which is a cost value associated with 
the false negative situation, we obtain the modified loss function: 


1 
L088 pc. cost? = — oe (- Pe Oar ) (10) 


L088 pop Cost) > LOSS popo (11) 


After introducing a cost value of c(1,0), the loss value obtained from Equation (10) demon- 
strates an increase, as shown in Equation (11). This cost value impacts the false negative’ 
case during the training process, leading to a decrease in false negative cases and an increase 
in sensitivity. 


3.3.2. Handling False Positive Cases 


In a false positive case, where y = 0 and 7 = 1, to achieve 7 = 1, O,.,,, must be a positive value. 
If c(0,1) = 1, which means no cost values being used, we obtain the loss function: 


1 
LOS8 pane) = 1 — log 
BCE, 2 t 
a 1 + e7 O founi 


If we introduce c(0,1) = TEN with c(1,0) > 1, which represents a cost value associated 
with the false positive situation, we obtain the modified loss function: 


1 
055 BCE Cost? | 082 (- pO am) oo) 


Teur, 


(12) 


L088 RoR Cost) < LOSS gon (14) 


After introducing a cost value of c(0, 1), the loss value obtained from Equation (13) demon- 
strates a decrease, as shown in Equation (14). This cost value influences the training process, 
leading to a decrease in false positive cases and a reduction in specificity. 

By incorporating cost-sensitive learning for both ” false negative” and ” false positive” cases, 
the model can effectively update its weights to improve performance based on the specific 
misclassifications encountered during training. 
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4. Applications 


We apply the cost-sensitive activation function approach to deep learning CAD models for 
lung cancer and breast cancer. 

For the lung cancer application, we use the NIH/NCI LIDC”® data and for the breast cancer 
application, we use the BreakHis.?” For both applications, the data was split into training, 
validation and testing sets using stratified random sampling, with proportions of 70%, 10%, 
and 20%, respectively. To ensure more robust results, we repeated the process of data splitting 
and model development for 30 times. The classification performance on the testing set was 
reported with the mean value across all 30 trials and a 95% confidence interval. 


Fig. 2: Multiple visual appearances of cropped lung nodules: The two nodules on the left 
exhibit malignant features, characterized by spiculated contours and larger size, while the two 
on the right are benign, displaying smaller, smoother nodules indicative of non-malignancy. 


4.1. LIDC Dataset 


The LIDC”® dataset contains 2,680 distinct nodules found in Computed Tomography (CT) 
scans from 1,010 patients. In this study, we implemented the following data preprocessing 
steps: First, we cropped nodules into images of size 71 x 71 (Figure 2), which is the size of 
the largest nodule in the dataset. Third, we assigned malignancy classification labels, where 
nodules with malignancy ratings of 1 (highly unlikely) and 2 (moderately unlikely) were 
labeled as ’Benign’, while nodules with malignancy ratings of 4 (moderately suspicious) and 
5 (highly suspicious) were labeled as ’Malignant’. After data pre-processing, which includes 
normalization and removal of indeterminate nodules with malignancy rating 3, we were left 
with 1,605 nodules. This dataset is imbalanced, comprising 699 malignant and 906 benign 
ones. 


4.2. BreakHis Dataset 


BreakHis?’ comprises of 7,909 histological images of breast tumor tissue collected from 82 pa- 
tients using varying magnification levels (40X, 100X, 200X, and 400X). It contains imbalanced 
data with 2,480 benign and 5,429 malignant images. The images were 3-channel RGB, 8-bit 
depth each channel and dimension of 700 x 460 pixels. In this study, we normalized images 
using min-max normalization. A few sample images are shown in Figure 3. 
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Fig. 3: Breast cancer histological images from the BreakHis Dataset. The left two images 
depict malignancy, while the right two show benign samples, illustrating critical distinctions 
for diagnosis. 


4.3. Design and Architecture of the Deep CNN Model 


The transfer learning method overcomes the limitation of having a small amount of training 
data by initially pre-training a deep learning model on a publicly available large dataset. As 
part of our study’s cost-sensitive algorithm classification model, we fine-tune”® a pre-trained 
ResNet18 convolutional neural network from ImageNet?® on our own dataset. 

For this study, we followed Nibali et al.’s recommendation” and utilized the ResNet18 CNN 
architecture.*! In the intermediate layers of the architecture, the Rectified Linear Unit (ReLU) 
activation function is used. Given the objective of addressing a binary classification problem 
(malignant vs. benign), we implemented a single neuron with the sigmoid activation function 
as the output layer. Consequently, during training, if the output of the sigmoid activation 
function is greater than or equal to 0.5, we classify the instance as Malignant (positive class); 
otherwise, we classify it as Benign (negative class). 


4.4. Experimental Results 


The accuracy, sensitivity, and specificity assessment of the two datasets across 30 trials are 
tabulated in Table 1 through Table 4, showing the performance when the classifier is trained 
with different misclassification costs on the LIDC and BreakHis datasets. The first column and 
second column indicate the cost value given for false positives and false negatives, while the 
various metrics are listed from the third to the last columns. The numbers in bold indicate 
a significant difference compared to the results obtained without any cost values, whereas 
non-bold numbers indicate no significant difference. 

From the experimental results, the cost-sensitive classifier indeed supports the analysis 
in Section 3.3. Table 1 reveals that we can enhance sensitivity while maintaining the same 
level of accuracy. Notably, in both datasets, sensitivity increases with a decrease in speci- 
ficity for higher values of c(1,0), which is associated with false negatives, and lower values of 
c(0,1), which is associated with false positives. For the LIDC dataset, the highest sensitivity 
was achieved when c(0,1) = 0.33 and c(1,0) = 3, with accuracy and specificity remaining 
unaffected. We observed a significant improvement of 3.84% compared to the baseline model, 
where c(0,1) = c(1,0). Table 2 presents the classification performance of the BreakHis dataset, 
showing a 5.4% significant improvement when c(0,1) = 0.067 and c(1,0) = 15. Additionally, we 
observe a trend of significantly increasing sensitivity with decreasing specificity while keeping 
the overall accuracy within the range of 86.55% to 88.10%. In Tables 1 to 4, numbers within 
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trials with increasing c(1,0) cost values. 


c(0,1) (1,0) Accuracy Sensitivity Specificity 
1 1 85.26% 76.22% 91.25% 
(84.63%, 85.89%) (74.62%, 77.83%) (90.33%, 92.17%) 
05 9 85.25% 76.89% 90.79% 
(84.39%, 86.11%) (75.17%, 78.61%) (89.91%, 91.67%) 
0.33 3 85.80% 80.06% 89.61% 
l (85.19%, 86.41%) (78.81%, 81.30%) (88.54%, 90.69%) 
0.2 5 84.93% 78.22% 89.37% 
i (84.37%, 85.48%) (76.76%, 79.69%) (88.39%, 90.35%) 
0.1 10 84.81% 78.97% 88.67% 
i (84.06%, 85.55%) (77.33%, 80.62%) (87.31%, 90.04%) 
0.067 15 84.56% 78.58% 88.53% 


(83.90%, 85.23%) 


(76.83%, 80.33%) 


(87.23%, 89.83%) 


Table 2: The classification performance on the BreakHis testing data was assessed through 30 


trials with increasing c(1,0) cost values. 


c(0,1) c(1,0) Accuracy Sensitivity Specificity 
1 1 88.10% 90.38% 83.10% 
(88.03%, 88.16%) (90.29%, 90.47%) (82.98%, 83.21%) 
05 9 88.38% 91.84% 80.81% 
(88.32%, 88.45%) (91.75%, 91.93%) (80.63%, 80.99%) 
0.33 03 88.39% 93.79% 76.58% 
l > (88.33%, 88.46%) (93.69%, 93.88%) (76.40%, 76.76%) 
0.2 5 87.82% 95.05% 72.00% 
(87.74%, 87.91%) (94.92%, 95.18%) (71.76%, 72.23%) 
0.1 10 87.15% 95.71% 68.42% 
(87.04%, 87.26%) (95.58%, 95.84%) (68.00%, 68.83%) 
0.067 15 86.55% 95.78% 66.66% 


(86.51%, 86.79%) 


(95.64%, 95.92%) 


(66.09%, 67.23%) 


parentheses represent 95% confidence intervals, with bold numbers indicating significant dis- 
tinctions compared to the baseline model (c(0,1) = c(1,0)). 

An increase in sensitivity is observed with decreasing c(0,1) values, which shows significant 
improvement compared to the baseline models. Table 3 illustrates the results, focusing on 
increasing c(0,1) values, which will lead to a decrease in sensitivity significantly and an increase 
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Table 3: The classification performance on the LIDC testing data was assessed through 30 
trials with increasing c(0,1) cost values. 


c(0,1) c(1,0) Accuracy Sensitivity Specificity 
85.26% 76.22% 91.25% 
(84.63%, 85.89%) (74.62%, 77.83%) (90.33%, 92.17%) 
84.83% 75.06% 91.31% 
(84.13%, 85.52%) (73.24%, 76.88%) (90.24%, 92.38%) 
84.39% 73.25% 91.77% 
(83.70%, 85.07%) (71.44%, 75.06%) (90.66%, 92.88%) 
83.00% 69.56% 92.41% 
(82.53%, 84.08%) (67.24%, 71.88%) (91.17%, 93.65%) 
81.93% 64.11% 93.74% 
(81.03%, 82.82%) (61.89%, 66.33%) (92.70%, 94.78%) 
80.48% 60.69% 93.59% 


15 0.067 


(79.38%, 81.58%) (57.28%, 64.11%) (92.36%, 94.83%) 


Table 4: The classification performance on the BreakHis testing data was assessed through 30 
trials with increasing c(0,1) cost values. 


Accuracy Sensitivity Specificity 
88.10% 90.38% 83.10% 
(88.03%, 88.16%) (90.29%, 90.47%) (82.98%, 83.21%) 
87.43% 88.75% 84.52% 
(87.35%, 87.50%) (88.64%, 88.87%) (84.33%, 84.71%) 
84.70% 84.43% 85.27% 
(80.93%, 88.46%) (78.48%, 90.39%) (84.22%, 86.32%) 
81.34% 79.02% 86.41% 
(76.26%, 86.41%) (70.99%, 87.05%) (85.02%, 87.80%) 
70.62% 61.39% 90.85% 
(63.16%, 78.10%) (49.71%, 73.08%) (89.06%, 92.65%) 
69.41% 59.33% 91.50% 


(62.17%, 76.66%) 


(48.02%, 70.63%) 


(89.76%, 93.20%) 


in specificity. We observe that with c(0,1) = 15 and c(1,0) = 0.067, the highest specificity is 
achieved, albeit with a trade-off in sensitivity. When we examine the impact of decreasing 
c(0,1) values as illustrated in Table 4, we observe a reverse relationship, with decreasing 
specificity and increasing sensitivity. This leads to a notable 8.4% improvement in specificity; 
however, it comes at the expense of a reduction in sensitivity. 
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For both the LIDC and BreakHis datasets, we observe an increasing trend of sensitivity 
with increasing c(1,0) values, and an increasing trend of specificity with increasing c(0,1) 
values. Notably, in both datasets, sensitivity increases with the decrease in specificity for 
higher c(1,0) values and lower c(0,1) values. 

Our results indicate that by tuning the cost values, we can achieve higher sensitivity or 
specificity. Significantly, the overall accuracy remains consistent in the majority of cases. 


5. Conclusion and Future Work 


In this study, we proposed the incorporation of a cost-sensitive values into the activation func- 
tion for deep learning-based CAD systems. Specifically, it addresses one of the most common 
problems in CAD, which is improving true positives measured using sensitivity, by adjust- 
ing the cost values without significantly impacting accuracy. The effectiveness and robustness 
of the model are demonstrated through theoretical analysis and experiments on different 
datasets. Compared with previous work on LIDC and BreakHis, this is the first study that 
utilizes cost values in activation functions to enhance sensitivity. Our findings strongly suggest 
that incorporating cost values as hyperparameters in future CAD systems holds promising ben- 
efits, with statistically significant increases of 3.84% and 5.4% in sensitivity, while maintaining 
overall accuracy, for LIDC and BreakHis Data. 

While our study has yielded valuable insights, certain constraints, limited to the datasets 
used in this research, may impact the generalizability of our findings. Furthermore, our study 
predominantly relied on a single activation function and the use of binary cross entropy loss. 

Future investigations can involve the inclusion of datasets from Medical Imaging and Data 
Resource Center? (MIDRC), The Cancer Imaging Archive*? (TCIA), and other medical- 
related databases, which can provide a broader perspective on model performance. To im- 
prove model performance and address imbalanced datasets, we can explore various activation 
functions and loss functions, such as focal loss.*4 In addition to this, we will also explore 
different thresholds for classifying instances as malignant or benign based on the sigmoid ac- 
tivation function output. Currently, we use a default threshold of 0.5. We will also investigate 
the integration of the proposed cost values with the group Distributionally Robust Optimiza- 
tion? (gDRO) algorithm. This approach aims to enhance the worst group performance while 
preserving the overall CAD system’s effectiveness. Additionally, we plan to delve into the 
impact of cost-sensitive activation functions on multi-class classification. These endeavors col- 
lectively aim to improve the accuracy and effectiveness of the cost-sensitive learning approach 
in medical diagnosis, ultimately benefiting diagnostic decision-making and patient outcomes. 
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VetLLM: Large Language Model for Predicting Diagnosis from Veterinary Notes 


Yixing Jiang, Jeremy A. Irvin, Andrew Y. Ng and James Zout 
Stanford University, Stanford, CA, United States t E-mail: jamesz@stanford.edu 


Lack of diagnosis coding is a barrier to leveraging veterinary notes for medical and public 
health research. Previous work is limited to develop specialized rule-based or customized 
supervised learning models to predict diagnosis coding, which is tedious and not easily trans- 
ferable. In this work, we show that open-source large language models (LLMs) pretrained 
on general corpus can achieve reasonable performance in a zero-shot setting. Alpaca-7B can 
achieve a zero-shot F1 of 0.538 on CSU test data and 0.389 on PP test data, two standard 
benchmarks for coding from veterinary notes. Furthermore, with appropriate fine-tuning, 
the performance of LLMs can be substantially boosted, exceeding those of strong state- 
of-the-art supervised models. VetLLM, which is fine-tuned on Alpaca-7B using just 5000 
veterinary notes, can achieve a F1 of 0.747 on CSU test data and 0.637 on PP test data. It 
is of note that our fine-tuning is data-efficient: using 200 notes can outperform supervised 
models trained with more than 100,000 notes. The findings demonstrate the great potential 
of leveraging LLMs for language processing tasks in medicine, and we advocate this new 
paradigm for processing clinical text. 


Keywords: Diagnosis Extraction, Veterinary Notes, Veterinary Medicine, Large Language 
Models, LLM, Foundation Models. 


1. Introduction 


Most veterinary records are in free-text forms without structured diagnostic codes, making 
it difficult to use for medical research, public health monitoring or quality-improvement pro- 
grams.! For example, the eligibility criteria for many clinical trails include diagnosis history. 
It is challenging to accurately identify certain cohorts which meet specific diagnostic criteria 
for translational research without structured diagnosis codes for each individual animal. A 
small number of large veterinary centers hire dedicated coding staff to manually apply disease 
codes to clinical records, which is labor-intensive, while most veterinary clinics do not code 
the notes.' One potential solution that previous works have explored is to develop systems 
which automatically code veterinary notes. However, these approaches have been limited to 
specialized rule-based or machine learning-based models, which are tedious to design and do 
not easily generalize well to new formats of reports. 

Large language models (LLMs) have the potential to serve as an effective method for 
veterinary information extraction. There is rising interest in studying large language models, 
commonly referred to as types of “foundation models” (models which can be adapted to 
many different tasks). LLMs have a large number of parameters and are typically pre-trained 
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on a large text corpus. They have shown promising performances on many NLP tasks, even 
in zero-shot and few-shot settings.?> However, there is no study on how well those LLMs 
perform on analyzing veterinary notes. Besides, veterinary notes have shifted styles compared 
with general text available on Internet. For example, the vocabulary used is different and 
many acronyms are included. Given the pre-training corpus for most LLMs was sourced from 
Internet, veterinary notes are good examples to evaluate the performance of LLMs on atypical 
text. 
Our contributions can be summarised as follows: 


(1) We develop VetLLM to extract diagnostic information from veterinary notes and inves- 
tigate its performance on dataset portions from two veterinary practices. Specifically we 
assess performance of models trained without any fine-tuning and with fine-tuning. 

(2) We empirically show that LLMs can achieve promising performance on the task of diagno- 
sis extraction from veterinary notes. Base LLMs without finetuning can achieve reasonable 
performances under zero-shot settings. For example, Alpaca-7B can achieve an zero-shot 
F1 of 0.538 when evaluated on CSU test data. 

(3) Fine-tuned VetLLM perform better by a large margin compared with strong state-of-the- 
art supervised models. When evaluating on external test data, VetLLM outperforms the 
VetTag model by 21% and 8% in F1 score and exact match score respectively. 

(4) We find finetuning LLM for the diagnosis extraction task is data-efficient. More specifically, 
using 200 notes can outperform supervised models trained with more than 100,000 notes 
in terms of F1 score. 

(5) We detail a new paradigm for processing medical text in section 5.3, and the findings 
show the superiority of this new paradigm which leverages LLMs. Code will be available 
at https://github.com/stanfordmlgroup/VetLLM. 


2. Related Work 


Many previous studies have studied the automatic information extraction from clinical notes, 
including MetaMap,? statistical modeling,’ text CNN,® and long-short-term memory network 
(LSTM).’ More specifically, DeepTag! and VetTag® are some previous work on this dataset. 
DeepTag extended a bidirectional LSTM architecture with a hierarchical loss function and 
achieved better performances.! VetTag further leveraged transformers architecture? and con- 
ducted large-scale pre-training on veterinary text, leading to the current state-of-the-art per- 
formances on this dataset.® 

There has been some recent studies showing many of those LLM are “generalist” in the 
sense that they can perform reasonably well on a large variety of tasks across domains.?? 
In the medical domain, LLM have shown promising performances for many tasks, including 
information extraction,!? medical Q&A,*!!!? generating USMLE-style questions!? and radi- 
ology reports.'4 There are also some commentaries on the potential and regulation of LLM 
for medical use cases. 15-718 
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3. Methods 


The task is to extract diagnosis from veterinary notes which are in the free-text form. The ex- 
traction task can be formulated as a multi-class multi-label classification problem. Specifically, 
for each disease, the model should output whether there is positive mention in the veterinary 
clinical note. 

The development pipeline included data cleaning, model selection, prompt design, resolver 
design, model finetuning, and system evaluation. 


3.1. Data 


The DeepTag! dataset was used for the project. It contains over 100K expert labeled veterinary 
notes from the Colorado State University (CSU) and a private practice clinic (PP). Both CSU 
portion and PP portion used here were previously used for VetTag, and it’s noteworthy that 
Vet'Tag was also pre-trained on another much larger dataset. In this project, we selected nine 
most prevalent diseases for analysis due to computational constraints. These nine diseases 
covered at least one diagnosis in around 90% cases in both CSU and PP portion, and they 
covered around 60% to 70% of all top-level disease labels. We removed incomplete reports 
which are shorter than 200 characters after manual review to ensure data quality. 

The CSU portion contains 112,557 veterinary notes from the Colorado State University 
College of Veterinary Medicine and Biomedical Sciences. Each note was labeled with a set of 
SNOMED-CT codes by veterinarians at Colorado State. Colorado State is a tertiary referral 
center with an active and nationally recognized cancer center. We kept the same train/val/test 
as VetTag for fair comparison. 

The PP portion contains 586 discharge summaries curated from a commercial veterinary 
practice located in Northern California, and six notes were removed due to incompleteness. 
Two veterinary experts applied SNOMED-CT codes to these records. Records with coding 
discrepancies were reviewed by both coders to reach a consensus on each record. This dataset 
is drastically different from the CSU dataset. PP notes are written often in an informal style, 
evidenced by their shorter length and usage of abbreviations. The PP data also has a different 
diagnosis distribution compared to a specialized academic cancer center CSU. It is of note 
that all notes in the PP portion are used for testing serving as an external validation dataset. 

Table 1 shows the details of the dataset. Here is one example of veterinary note from 
PP portion together with the labels: cried at home 8 body condition not drinking excess 
not urinating more frequently appetite is normal energy level is good skin is normal heart 
auscultates normal abnormal findings pain over l-s pain pulling hips back x rays ventral dorsal 
hips cauda equina looked perfect Expert annotated diseases: ‘Hypersensitivity condition’, 
‘Propensity to adverse reactions’ 

Given both of these datasets are private and the data usage agreement prohibits data 
sharing with third parties, only models hosted locally can be used to analyze the data. 


3.2. Models 


Alpaca-7B and VetLLM Alpaca-7B’ was used as the base LLM model as it has been 
instruction fine-tuned and is publicly available. Furthermore, a subset of CSU training split 
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Table 1. Descriptive statistics 
of the DeepTag dataset 


CSU PP 
# of notes 112,557 580 
Size of test split 5483 580 
Avg # of words 368 253 


was used to further fine-tune Alpaca-7B using low-rank adaption,” leading to VetLLM. The 
details of fine-tuning was discussed in Section 3.4. The temperature for both Alpaca-7B and 
VetLLM was set to zero to allow reproducibility. 

VetTag The supervised baseline model was the one developed in the Vet'Tag paper, 
achieving state-of-the-art performances on the dataset. It was pre-trained on a large corpus 
of unlabelled veterinary notes (917,665 notes) using casual language modeling, and then fine- 
tuned using the training split (101,301 notes) of CSU portion. The prediction logits from 
Vet'Tag were obtained from the VetTag team, and the logits corresponding to the nine diseases 
were extracted to calculate the metrics. 

KeywordMatch Another baseline model was to use keyword matching. The synonyms 
of the diseases were retrieved using WordNet, and fuzzy matching with the partial ratio metric 
was used. The model would return positive if the partial ratio between the veterinary note 
and the disease names was above 80%. 

In short, four models would be compared: Alpaca-7B (LLM baseline), VetLLM (fine-tuned 
LLM), Vet'Tag (supervised baseline) and KeywordMatch. 


3.3. Prompts and Resolvers 


The guiding principle for prompt design is to follow the format of the instruction tuning set 
and to be clear and specific. We just tried a small number of prompts on ten notes from 
the CSU validation split, and the main metric was whether the output was easily resolvable. 
Figure 1 shows the prompt used along with one example input and output from the VetLLM. 
The prompt queried the LLM with one disease each time rather than querying the LLM to 
list down all diagnosis. This design choice greatly simplified the resolver design. The query 
was conducted on two A4000 GPUs. 

After getting the response from LLM, a resolver was utilized to convert the text response 
into a structured prediction. The resolver used in this study was simple, and it first converted 
the decoded text response into lower case and stripped any trailing space on the left. A positive 
prediction was rendered if the resultant string started with ” yes”, and a negative prediction 
was rendered if it started with ”no”. Otherwise, the case was rendered as un-resolvable. 


3.4. Finetuning 


A subset of 5,000 notes were randomly sampled from the CSU training split, and this subset 
was used for fine-tuning Alpaca-7B using low-rank adaption (LoRA)?? and four A4000 GPUs. 
We chose LoRA as it generally provides superior performances and induces no extra inference 
overhead. The fine-tuning samples were generated using the same prompt template described 
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Prompt Template 


Below is an instruction that describes a task, paired with an input that provides further context. 
Write a response that appropriately completes the request. 


### Instruction: 
Answer the following yes/no question based on the veterinary note delimited by triple backticks: 


#H# Input: 
““ftext}" 


Does this animal have {disease}? 
### Response: 
Example Input to VetLLM 


Below is an instruction that describes a task, paired with an input that provides further context. 
Write a response that appropriately completes the request. 


## Instruction: 
Answer the following yes/no question based on the veterinary note delimited by triple backticks: 


#HH Input: 

“tt w=12.9 Ibs itching constantly flea allergy dermatitis 7 body condition not drinking excess not 
urinating more frequently appetite is normal energy level is good skin is normal heart auscultate 
s normal abnormal findings lots of hair loss fleas found other cat treatment , prescription dex dro 
ps disc flea allergy dermatitis and not spraying yard roos Imor 8 / 28°" 


Does this animal have Hypersensitivity condition? 
### Response: 

Example Output from VetLLM 

yes</s> 


Fig. 1. Prompt Template with Example Input and Output from VetLLM 


in the previous section. The hyper-parameters were included in Appendix A. A subset of 200 
notes were randomly sampled from the CSU validation split to form the validation set for 
fine-tuning. An early stopping callback with a patience of five was added. 

To study the data efficiency of fine-tuning, the 5,000 notes subset was further sampled 
into 2000, 1000, 500 and 200 notes sequentially. Consequently, each subset of a smaller size is 
strictly a subset of the one of a larger size. And these subsets were each used as the fine-tuning 
set. In short, five fine-tuned Alpaca-7B models were trained. 


3.5. Evaluation Metrics 


As a multi-class multi-label classification problem, there were metrics for both the overall 
prediction and each individual class. More specifically, each model was evaluated based on 
exact match (EM, the fraction of notes where the algorithm’s predicted diagnoses exactly 
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Table 2. Quantitative evaluation on classification 
Model coy PE 
Exact Match Precision Recall F1 Exact Match Precision Recall F1 
VetLLM 53.5% + 0.7% 0.726 0.774 0.747 + 0.004 | 38.0% + 2.1% 0.661 0.630 0.637 + 0.015 
Alpaca-7B (zero shot) | 34.0% + 0.6% 0.604 0.527 0.538 + 0.005 | 22.0% + 1.7% 0.485 0.375 0.389 + 0.017 
VetTag (supervised) 49.3% 0.7% 0.798 0.492 0.592 £0.006 | 30.1% +1.9% 0.680 0.344 0.422 40.018 
KeywordMatch 29.1% 0.442 0.002 0.003 24.9% 0.050 0.006 0.010 


match the expert diagnoses), precision (the fraction of notes with positive predictions that 
match the expert diagnoses), recall (the fraction of notes where the expert diagnoses are 
successfully retrieved), and F1 (the harmonic mean of precision and recall). The last three 
metrics was macro-averaged across classes to get the overall metrics. The standard deviations 
of those metrics were calculated using bootstrapping with 1,000 re-samples. 


4. Results 
4.1. Overall Evaluation on Classification 


Table 2 shows the quantitative evaluation results averaged across classes. When evaluating on 
the CSU portion, Alpaca-7B performs reasonably in a zero-shot manner, with only 6% gap in 
F1 compared with the supervised baseline. With VetLLM which was fine-tuned using 5,000 
notes, the performances greatly improve, leading to a 21% boost in F1 and 19% boost in exact 
match score. 


4.2. Stratified Evaluation on Classification 


Figure 2 and 3 show the F1 metrics of three models evaluated on each class. They show 
the VetLLM model, fine-tuned from Alpaca-7B, outperforms the supervised VetTag model 
in each single class on both in-distribution data (CSU portion) and out-of-distribution data 
(PP portion). They also show significant improvements in performances in most classes after 
fine-tuning, and there is no degradation in any class after fine-tuning. 


4.3. Data-efficiency of Fine-tuning 


Figure 4 shows how the performance improves as the number of fine-tuning samples increase. 
It shows only using fewer than 200 notes can exceed performances of the supervised model, 
demonstrating the data-efficiency of fine-tuning LLM. It is of note the X-axis represents the 
number of veterinary notes used, so the size of fine-tuning set is nine times that the number 
of notes. 


5. Discussion 


The results show the promise of VetLLM for diagnosis extraction task from veterinary notes, 
which is inspiring. More broadly, they demonstrate the great potential of leveraging LLMs for 
processing medical text which is detailed in Section 5.3. 

Although the performances are promising, they are sensitive to prompt design such as 
problem formulation, order of information and presence of extra information. In some cases, 
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Stratified Fl on CSU 


Mam Alpaca-7B 
(mm VetLLM 
Mm VetTag 


F1 Score 


Class 


Fig. 2. Stratified F1 on CSU test data. Alpaca-7B is the base LLM model, and VetLLM is Alpaca-7B 
fine-tuned with 5,000 notes. VetTag is the state-of-the-art supervised model. 


adding trailing spaces at the beginning of each line also affects the performances. It seems 
there is still no well-established systematic way of assessing LLM’s sensitivity towards prompt 
designs, but the development of LLM is likely to benefit from ongoing research on AI align- 
ment. Therefore, more comprehensive evaluation must be conducted or some post-hoc quality 
control measures must be taken if this system is to be deployed. 

Also, the evaluation in this paper is limited to datasets from two centers in the United 
States. Veterinary notes from other veterinary medicine centers are likely to have different 
distributions which might affect performances. 


5.1. Error Analysis 


To gauge the knowledge embedded in LLM, the Alpaca-7B model was prompted to explain 
the top fourteen diseases. The responses from Alpaca-7B were included in Appendix B and 
manually reviewed in terms of relevance and factuality. The results indicate Alpaca can provide 
highly relevant and factually correct descriptions of diseases, hinting that the pre-training 
corpus might contain high-quality medical text describing various diseases. 
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Stratified F1 on PP 


EE Alpaca-7B 
Mam VetLLM 
Mmm VetTag 


F1 Score 


Class 


Fig. 3. Stratified F1 on PP test data. Alpaca-7B is the base LLM model, and VetLLM is Alpaca-7B 
fine-tuned with 5,000 notes. VetTag is the state-of-the-art supervised model. 


Furthermore, the correlation between note length and performances were analyzed using 
the two portions, and the results are shown in Figure 5 and 6. The note lengths were binned 
into five quantiles. Based on the results, the exact match score is negatively correlated with 
the note length, while the trends for F1 score seem inconsistent. 


5.2. Computational Costs 


In the era of large models, computational costs and environmental impact of model training 
and inference have become more concerning. All estimates in this section are in the settings 
of four NVIDIA RTX A4000 GPUs launched in April 2021, and each A4000 has 16GB GPU 
memory. VetLLM was fine-tuned from Alpaca-7B using 5,000 notes, and the fine-tuning took 
around 48 hours with a micro batch size of one. 

One limitation of VetLLM is it requires multiple pass for multi-class classification, while 
traditional supervised models can generate multi-class predictions with a single pass by using 
multiple neurons in the last layer. A single inference pass for VetLLM takes around 0.3 seconds, 
and the model loading before first use takes around 15 seconds. It means VetLLM is likely to 
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Data efficiency of fine-tuning 


—— FlonPP 

==- VetTag on PP 
—— F1 on CSU 
==- VetTag on CSU 


0 1000 2000 3000 4000 5000 
Number of notes used for fine-tuning 


Fig. 4. Data efficiency plot. The number of notes here refer to the number of notes in the CSU 
training split used for fine-tuning Alpaca-7B. PP portion was only used for test. It is of note that 
VetTag used over 100,000 notes for fine-tuning. 


have slower inference speed compared with traditional supervised models. Given the significant 
boost in performances and the application does not have strong real-time requirement, we 
think the increased inference time is reasonable. One mitigation, which we leave as future 
work, is to utilize a multi-label approach such as asking multiple questions in a single turn or 
asking the model to select all diseases present in the veterinary note. 


5.3. New Paradigm for Processing Medical Text 


In this paper, we demonstrate the potential of a new paradigm for processing medical text: 
starting with pre-trained large language models (LLM), then designing a prompt and resolver. 
The resolver interprets the output from the LLM and transforms the raw output into struc- 
tured answers. After designing the prompt and resolver, the next step is to conduct a quick 
evaluation in a zero-shot or few-shot setting. If the performance is satisfactory, it is a good 
idea to proceed with more comprehensive evaluation and iteratively improve the prompt and 
resolver. If the performance is poor, it might be worth curating a small fine-tuning dataset and 
utilizing data-efficient techniques like LoRA to fine-tune the LLM. Evaluation and iterative 
refinement can be conducted after the fine-tuning. 

Thanks to the great contributions from various communities, most of these steps have 
been implemented in various library packages or are available as API calls, thereby speeding 
up the entire pipeline. The traditional pipeline for processing medical text involves curating a 
large training dataset and training a specialized model via supervised learning. This pipeline 
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Fig. 5. Performances of VetLLM in terms of note length on CSU test data. The X-axis refers to the 
upper quantile of note length for each binned group. Exact match is a baseline evaluation metric, 
and F1 is used more frequently in practice. 


tends to require significant resources, with the process often spanning several months or even 
years. This new paradigm might lower the barriers to building some interesting applications, 
with many potentially developed within weeks. 

Beyond the great performance and fast iteration, another advantage of this new paradigm is 
the ability to easily expand classification categories. For example, the prompt can be modified 
to extract diagnosis of other diseases. In contrast, traditional supervised models might require 
extensive fine-tuning to include new classes. 


6. Conclusion 


In this study, large language models were used for diagnosis extraction task from veterinary 
notes. With fine-tuning only on a small number of notes, VetLLM outperform strong su- 
pervised models significantly. Given the time constraints, simple prompts and resolvers were 
used in the study. Richer prompt strategies can be explored, and robustness towards prompt 
variations should be examined. 

In a broader sense, this project has shown the potential of LLMs to work on clinical data 
and be efficiently fine-tuned to achieve strong performances on downstream tasks. Although 
this study is limited to veterinary notes, we believe the new paradigm detailed in section 5.3 is 
generally applicable. Therefore, it is interesting to evaluate the performances of base LLMs and 
fine-tuned LLMs in other medical applications including human clinical notes. Furthermore, 
it might be interesting to conduct similar assessment using more advanced models or more 
domain-specific ones. 
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Fig. 6. Performances of VetLLM in terms of note length on PP test data. The X-axis refers to the 
upper quantile of note length for each binned group. Exact match is a baseline evaluation metric, 
and F1 is used more frequently in practice. 
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Appendix A. Hyper-parameters used for fine-tuning 
Table A1 shows the hyper-parameters used for LoRA fine-tuning. 


Table Al. Hyper-parameters used for fine- 


tuning 

Name Value 
batch_size 32 
max_epochs 3 
cutoff_len 2048 
train_on_inputs False 

lora_r 8 
lora_alpha 16 
lora_dropout 0.05 
lora_target_modules | | ”q_proj”, ”v_proj” | 
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Appendix B. Alpaca-7B’ responses to disease definition questions 


Disorder of auditory system : Disorder of the auditory system is a broad term that refers 
to any disorder of the ear or hearing system that results in a hearing impairment or loss. 
This can include conditions such as conductive hearing loss, sensorineural hearing loss, 
mixed hearing loss, and other disorders that affect the ability to hear. 

Hypersensitivity condition : Hypersensitivity is an abnormal increase in the body’s reac- 
tivity to a substance or stimulus, such as an allergy or an autoimmune disorder. Hypersen- 
sitivity reactions can range from mild to severe and can affect the skin, respiratory tract, 
gastrointestinal tract, and other organs. Symptoms may include rash, hives, eczema, con- 
gestion, coughing, wheezing, and other respiratory symptoms, as well as digestive issues. 
Disorder of cardiovascular system : Disorder of the cardiovascular system is any abnor- 
mality of the heart, blood vessels, or blood that affects how the body circulates blood. 
This can include conditions such as high blood pressure, heart disease, stroke, and blood 
clots. 

Visual system disorder : Visual system disorder is a condition that affects the way the eyes 
send and process visual information. It can cause problems with vision, such as blurred 
vision, double vision, or difficulty with eye coordination and tracking. It can also lead to 
headaches, eyestrain, and other vision-related problems. 

Infectious disease : Infectious disease is a broad term used to describe any disease that is 
caused by the presence, growth, and/or activity of a pathogen, such as a virus, bacteria, 
parasite, or fungus. These pathogens can cause illnesses ranging from mild to severe, and 
can even be fatal. Infectious diseases are spread through contact with infected individuals, 
contact with contaminated surfaces, or contact with contaminated food or water. 
Disorder of musculoskeletal system : Disorder of the musculoskeletal system is a general 
term that describes any injury or disease that affects the muscles, bones, joints, and asso- 
ciated structures. Examples of disorders of the musculoskeletal system include fractures, 
sprains, strains, tendinitis, bursitis, and muscle pain. 

Disorder of integument : Disorder of the integument is a general term used to describe a 
group of skin disorders that affect the structure or function of the skin. These disorders 
can range from mild forms such as dry skin or eczema, to more severe forms such as 
psoriasis or dermatitis. 

Disorder of digestive system : Disorder of the digestive system is any problem with the di- 
gestive tract that impairs its functioning and can cause symptoms such as abdominal pain, 
nausea, vomiting, diarrhea, constipation, and other complications. Common disorders of 
the digestive system include gastroenteritis, irritable bowel syndrome, inflammatory bowel 
disease, and cancer. 

Neoplasm and/or hamartoma : Neoplasm is a general term for a tumor or growth that is 
caused by abnormal cellular proliferation, while hamartoma is a benign tumor composed 
of normal tissue. 
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1 Abstract 


Recent research has effectively used quantitative traits from imaging to boost the capabilities of 
genome-wide association studies (GWAS), providing further understanding of disease biology and 
various traits. However, it's important to note that phenotyping inherently carries measurement 
error and noise that could influence subsequent genetic analyses. The study focused on left 
ventricular ejection fraction (LVEF), a vital yet potentially inaccurate quantitative measurement, 
to investigate how imprecision in phenotype measurement affects genetic studies. Several methods 
of acquiring LVEF, along with simulating measurement noise, were assessed for their effects on 
ensuing genetic analyses. The results showed that by introducing just 7.9% of measurement noise, 
all genetic associations in an LVEF GWAS with almost forty thousand individuals could be 
eliminated. Moreover, a 1% increase in mean absolute error (MAE) in LVEF had an effect 
equivalent to a 10% reduction in the sample size of the cohort on the power of GWAS. Therefore, 
enhancing the accuracy of phenotyping is crucial to maximize the effectiveness of genome-wide 


association studies. 


Keywords: Precision phenotyping; Genome-Wide association study; Left ventricular ejection 
fraction; Cardiac magnetic resonance imaging; UK Biobank 
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2 Introduction 


Cardiovascular disease is the leading cause of death in the world, and significant work has been 
undertaken to understand the mechanisms of disease and develop preventive measures. By 
studying the human genome, insights have been obtained to understand pathways and 
mechanisms of function and disease risk, and in recent studies, researchers have moved beyond 
binary labels of disease diagnosis to quantitative phenotypes to obtain greater power in assessing 
the relationship between genotype and phenotype! ~. From quantitative laboratory biomarkers 
elucidating the relationship between hypercholesterolemia and coronary artery disease’ to 
imaging characteristics in population cohorts* revealing the genetic determinants of 
cardiovascular development °’, quantitative assessments of health provide additional signal 
compared to conventional binary labels of disease. 

Despite its relative frequency, critical public health importance, and often penetrant 
inheritance, heart failure has relatively few known genetic risk factors. Early classic genetic 
studies were not able to identify many genetic associations with measurements determined by 
echocardiography’. Recent studies with larger cohorts and measurements from cardiac MRI have 
been able to find additional loci of relevance and reaffirm previously suspected variants’, 
suggesting both larger sample sizes, as well as improvements in phenotyping precision, can 
improve our understanding of the human disease. 

While quantitative traits often have more power than binary labels of disease, the issue of 
measurement error in quantitative traits is a known problem’. For example, left ventricular 
ejection fraction (LVEF) as measured by echocardiography can have measurement variation up 
to 7 - 10%!°!!, impacting downstream analyses. We use LVEF, the most prevalent metric of 
cardiac function, as an example of an important but noisy measurement to explore the impact of 
measurement variability on downstream genetic association studies. We compare various 
methods to obtain the same phenotypic measurement as well as introduce simulated noise in the 
phenotype measurement to evaluate the relative impact of measurement noise and sample size on 
downstream genetic studies. 


Table 1. Cohort baseline characteristics 


Characteristic Mean orn 
N 39624 
Age at MRI 54.9 + 7.47 
Male 18933 (47.8%) 
Self-identified White British 33726 (85.1%) 
Body mass index (kg/m?) 26.5 + 4.19 
Hypertension 2487 (6.3%) 
Pulse rate 67.9 + 10.9 
LV ejection fraction (%) 55.4 (6.78) 
LV end diastolic volume (mL) 141 
LV end systolic volume (mL) 64.1 
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3 Methods 


3.1 Cohort 


The UK Biobank is a population-based cohort that links genetic and phenotypic data for 
approximately 500,000 adult participants from the United Kingdom !>!3, We focused on 39,624 
participants who had InlineVF measured LVEF !4, cardiac MRI, and genetic data available. 
Before running Genome-Wide Association Studies this cohort was passed through additional 
quality check filters (Figure A1). 


3.2 Multiple Approaches to Measure LVEF 


Multiple methods of calculating LVEF from the same underlying imaging data were used to 
assess the impact of phenotyping precision on downstream analyses. First, the UKB provides 
automated LVEF measurements derived from MRI using Inline VF software!>, however, this is 
presented without manual quality control. To compare alternative automated approaches, we also 
derived LVEF from MRIs using the deep learning segmentation approach suggested by Bai et al 
é, From the short-axis view videos, segmentation was performed, we calculated the LV volume 
for each frame with Simpson’s method and used the following LVEF formula: 


ED Volume- ES Volume x 100 (1) 
ED Volume 

To simulate reader variability, additional experiments were performed introducing Gaussian 
noise with a mean of 0 and a standard deviation (sd) ranging from [1,10]. We generated multiple 
phenotypic measurements from the same underlying imaging data, gradually incrementing 
Gaussian noise, and performed GWAS on each to investigate how measurement 
error/imprecision affects genetic associations. 

Additionally, we further compared results with two final approaches to assess LVEF. When 
visually assessing LVEF, clinicians often round the value to the nearest 5%, thus we generated a 
set of phenotype labels by rounding LVEF values to the nearest multiple of 5. For the final 
comparison, we generated binary LVEF labels by categorizing values as normal or abnormal, 
with normal values ranging from 52-72 for males and 54-74 for females. 


3.3 Genome-wide association study 


We used the UKB imputed genotype calls in BGEN v1.2 format. Samples were genotyped using 
the UK BiLEVE or UK Biobank Axiom arrays. Imputation was performed using the Haplotype 
Reference Consortium panel and the UK10K+1000 Genomes panel”. We used the QC files 
provided by UKB to create a GWAS cohort consisting of subjects who did not withdraw, were of 
inferred European ancestry, and were unrelated. Subjects with a genotype call rate < 0.98 were 
also removed. We considered variants with a minor allele frequency (MAF) > 0.01, and we 
required genotyped variants to have a call rate > 0.95 and imputed variants to have an INFO 
score > 0.3. Variants with a Hardy-Weinberg equilibrium P value < 1x10- were excluded. After 
variant filtering, we were left with 9774199 filtered variants. GWAS was done on a Spark 3.1.1 
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cluster, using the library Hail 0.2 with Python version 3.6. The GWAS was adjusted for age at 
MRI and sex. We used the conventional P value of 5x10° as the threshold for defining genome- 
wide significance. 


3.4 Assessing Association Power’s Relationship with Cohort Size 


Apart from noise in phenotype measurements, we also evaluate the effect of cohort decrease on 
GWAS results. We generated 6 different phenotype files where, starting from the original LVEF 
cohort (39,624), we keep 90% (35,661), 80% (31,699), 70% (27,736), 60% (23,774), 50% 
(19812), and 40% (15850) of the samples. Cohort decrease was performed before GWAS QC, 
and for each step the selection of samples to be excluded was random. Inspecting the effect of 
cohort decrease helps us define the relationship between the number of LVEF samples and 
GWAS power. 


3.5  SNP-based accuracy 


We use an accuracy metric to determine the amount of overlap in significant SNPs between the 
baseline GWAS results and noise-modified GWAS results. First, we remove all non-significant 
SNPs by excluding SNPs with a p-value less than 5 x 1078, which is the Bonferroni corrected 
p-value threshold. Then, we consider significant SNPs found in both the base results and noise- 
modified results as true positives (TP), the SNPs found only in the noise-modified results as false 
positives (FP), and the SNPs not found in the noise-modified results but found in the base results 
as false negatives (FN). We then calculate 


TP 


SN Paccuracy = TP4FP+FN (2) 


3.6 GWAS Sensitivity 


Sensitivity determines the amount of overlap in significant loci between the baseline GWAS 
results and noise-modified GWAS results. Specifically, given that peakSpase is the number of 
significant loci in base GWAS, and peakScorrect 1s the number of significant loci that persisted 
in noise GWAS then 


wee peakScorrect 

Sensitivity = a ee (3) 
The number of loci and their position can be determined by manual inspection, but we also 
developed an automatic method. Our automatic method applies a hierarchical clustering 
algorithm on SNPs above the significance threshold line to determine the number and the 


position of loci from both GWAS, which we then use to compute peak Spase and peakS correct: 


3.7 Heritability 


Heritability is a measure of the level of influence genetic variation has on a given trait’s 
phenotypic variation. To estimate SNP heritability based on GWAS summary statistic we use 
command line tool LDSC'®. LDSC performs LD score regression between GWAS test statistic x? 
and per SNP LD scored which allows for the estimation of h2 
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Figure 1 Manhattan plots for genome-wide association studies on UK Biobank reported left ventricular 
ejection fraction a, GWAS on continuous LVEF measurements b, GWAS on Normal/Abnormal LVEF 
where the range for normal is 52-72 in male and 54-74 female population c, GWAS on LVEF bucketed to 
the nearest multiple of 5 


4.1 Quantitative phenotypes improve power of association studies 


The study cohort for all analyses consisted of 39,624 adult unrelated subjects of European 
ancestry (Table 1). As a baseline, we first conducted a GWAS of the LVEF phenotype released 
with the UKBB cardiac MRI data. We identified 5 loci at genome-wide significance on 
chromosomes 1, 6, 8, 10, and 19 near genes ZBTB17, CDKNIA, CTSB, BAG3, and APIMI 
(Figure 1). In comparison, for an LVEF phenotype binarized to simply abnormal or normal, 
multiple previously detected loci lost genome-wide significance (including loci for CTSB and 
APIMI). Similarly, recognizing the inherent variation present in measuring LVEF, we 
additionally compared the results if the LVEF was bucketed to 5% bins and showed such 
imprecision decreased statistical power in all SNPs in the association study compared to the 
continuous LVEF baseline phenotype. 
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Figure 2 Impact of noise in LVEF on GWAS a, Visualizing r2 score, mean absolute error, and the 
distribution of noise-modified-LVEF with respect to the baseline LVEF. b, Q-Q plots of P values from 
GWAS summary statistics for different levels of noise 


4.2 Phenotype noise degrades power of association studies 


To investigate the effect of measurement imprecision on GWAS power, we performed a series of 
association studies while introducing noise in the range of known clinician variation (Figure 2). 
Simulated variation to the LVEF measurement naturally increases in mean absolute error. Noise 
with a gaussian standard deviation of 5 results in a mean absolute error of 3.97% and R2 of 0.65 
(Table A1), and results in the loss of genome-wide significance for the APIMI loci on 
chromosome 19. As we increase phenotypic noise in the range of clinical variation, heritability 
and power gradually declines and the noise equivalent to 7.92% MAE results in a complete loss 
of genomic-wide significance (Table 2). Given echocardiography is known to have a clinician- 
to-clinician variation of the same or greater MAE!?, such measurement imprecision could 
contribute to the limited hits in historial echocardiography-derived GWAS®. 


Table 2. Metrics of genetic signal for each increase in SD 


Noise SD SNP Accuracy Loci Sensitivity Heritability 
0% 1.0 1.0 0.1114 (00357) 
1% 0.9377 1.0 0.1055 (0.0332) 
2% 0.8547 1.0 0.0878 (0.0352) 
3% 0.3675 1.0 0.1003 (0.0265) 
4% 0.2537 0.8 0.089 (0.0256) 
5% 0.3921 0.8 0.1208 (0.0355) 
6% 0.0228 0.4 0.0179 (0.0271) 
7% 0.0307 0.4 0.0482 (0.0247) 
8% 0.0145 0.4 0.022 (0.0349) 
9% 0.0020 0.2 0.0355 (0.0204) 
10% 0 0 0.0477 (0.0214) 
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4.3 Comparison of Impact of Phenotype Noise vs Cohort Size 


Given the summary statistics from 16 different GWAS, we modeled the relationship between 
noise and GWAS power (Figure 3, Figure A4). There is a linear relationship between the 
increase in MAE and the decrease in GWAS power. We calculated that an increase of 1% in 
MAE causes the loci sensitivity to decrease by 13% (p=5.5e-6) and the SNP accuracy by 14% 
(p=6.6e-5). Experiments with other methods of introducing noise in assessing LVEF similarly 
show a decrease in genetic association with more imprecise measurements (Figure A3, Table 
A2). A similar effect occurs with reductions in cohort size, as a 1% decrease in cohort size 
results in a 1.3% decrease in loci sensitivity (p=0.01) and a 1.9% decrease in SNP-based 
accuracy (p=0.0007). We found that a 1% MAE increase has the same effect on loci sensitivity 
as a 10.3% cohort decrease and the same effect as a 7.2% cohort decrease on SNP accuracy. 


a Change in p value of top 5 loci with respect to noise b Change in p value of top 5 loci with respect to cohort decrease 
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S 


751 


0 2 4 6 35 40 
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Figure 3 a, Slope chart shows the change in the P value of the top 5 loci with respect to mean absolute 
error; b, Slope chart shows the change in a P value of top 5 loci with respect to the cohort decrease; each 
locus is named after the closest gene 


4.4 Improving phenotyping augments downstream genetic analyses 


Cardiac MRI provides clinicians and researchers with a plethora of high-resolution imaging, with 
even the abbreviated 20-min UK Biobank cardiac MRI protocol resulting in 9 sequences with 
over 30,000 images per study'’. With so many images and patients, the released UKBB 
measurements were generated using a fully automated workflow (with Siemens inLineVF) 
without quality inspection and bias correction. When compared with manual clinician evaluation, 
the automated measurements of LVEF result in a mean absolute error (MAE) of 3.4%, R2 of 
0.348, and ICC of 0.521 for LVEF!°. Imprecision in the inline LVEF can be partially addressed 
by linear adjustment'* and doing so slightly increases genetic signal, within the difference in 
identified loci with MAE of 1% (Figure A2). To evaluate the role of imprecision, we applied a 
deep learning-based method of obtaining LVEF and analyzed downstream results. Using a 
previously published deep learning segmentation model®, we independently derived LV 
segmentation-based calculated LVEF and found a MAE of 6.1%, R2 of 0.335, and ICC of 0.431 
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for LVEF compared to the automated measurements from UKBB, and MAE of 5.3%, R2 of 
0.60, and ICC of 0.518 compared to the linearly adjusted LVEF (Figure 5). However, with these 
deep learning segmentation derived LVEF measurements, the same cohort identified more loci 
of interest with significant experimental data backing its relevance. In particular, loci on 
chromosomes 2, 5, and 8 near genes TTN, DNAJC18, and ZNF572 were not previously 
identified using the released UKBB LVEF measurements but able to be picked up with our 
quality-controlled measurements. While we could not directly compare the segmentation-derived 
LVEF measurements to clinical labels due to the absence of manual labels, the stronger genetic 
signal and higher association with linearly adjusted LVEF suggest that deep learning derived 
LVEF is less noisy. 
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Figure 5 Differences in distribution and GWAS summary statistics between two methods of obtaining 
LVEF from MRI a, Histograms of InlineVF derived LVEF and Deep Learning derived LVEF b, 
Manhattan plot from GWAS performed on Deep Learning derived LVEF; genes colored in blue don’t 
appear in InlineVF LVEF GWAS (Figure 1a); genes colored in red appear in InlineVF LVEF GWAS but 
not in deep learning derived LVEF GWAS 


5 Discussion 


In this study, we assessed the impact of measurement noise on genetic associations with LVEF 
and found substantially impaired power in downstream GWAS analysis with even slight 
increases in measurement imprecision. Even slight phenotyping variation can significantly 
impact downstream genetic associations, often to a greater extent than changes in cohort size. 
As measurement variation is present in many clinical measurements, efforts to improve the 
precision of measurements can potentially be a cost-effective way to maximize the yield of 
genetic association studies. 

Cardiac function as measured by LVEF is an important clinical measurement that defines 
disease and identifies patients who are eligible for life-prolonging therapeutics as implantable 
devices. In echocardiography, human test-retest evaluation of LVEF can range between 7-10%, 
with slight changes in annotation as well as timing that can significantly impact calculations!®!”. 
Few variability studies have been undertaken in cardiac MRI, although similar degrees of manual 
measurement variability have been found”. Prior studies have suggested that polygenic risk 
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scores of LVESV have more power than polygenic risk scores of LVEF’, consistent with our 
analyses that more precise measurements correspond to stronger genetic associations. Our 
analysis suggests that a substantial and primary gain in signal comes from the improvement of 
noisy measurements that can affect the power and accuracy of downstream analyses. 

Noise in measurements can appear in both semi-automated and fully automated 
workflows!', and by improving the precision of measuring LVEF, we also improve the accuracy 
and robustness of downstream GWAS results. The relatively large improvement in yield of 
genetic association with more precise phenotyping was substantial in comparison to the marginal 
benefit of increasing the cohort size. As more genetic analyses are undertaken with automated 
measurements or assessments*’??!7, an additional evaluation must be taken to assess the 
variability and quality of the phenotyping. Such insights ideally will be confirmed with 
orthogonal measurements of similar phenotypes. Some of the first genetic association studies 
were performed on quantitative traits like height, but it should be recognized that many imaging- 
based phenotypes do not have the same precision and accuracy as the assessment of height on a 
population. 

In summary, genetic association studies on imaging phenotypes allow researchers to 
discover many associations that help understand the underlying biology of the disease and 
structure”. For LVEF, even advanced imaging has variability in measurements that can 
substantially impact downstream association studies. The impact of such variability is even more 
profound than significant changes in cohort size, suggesting improvement in imaging precision 
and precise phenotyping in general has significant additional value in improving the power of 
genetic association studies. 

Our study offers key insights into measurement noise’s effect on genetic associations with 
LVEF. However, a few considerations remain. The impact of measurement noise could vary for 
different quantitative phenotypes, and thus future studies should investigate its influence on 
various phenotypes for a broader understanding. Secondly, our GWAS methodology could be 
further enhanced by using a linear mixed model method”*, shown to produce more significant 
associations. Lastly, while our deep learning LVEF method showed a high GWAS signal, we 
could not compare it to manual clinical labels due to their unavailability. 
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6 Appendix 
Table A1. Mapping between Gaussian Noise SD and MAE 


SD MAE R2 

0 0 1 

1 0.797489 0.9788 
2 1.594416 0.9199 
3 2.386753 0.8371 
4 3.183924 0.743 
5 3.974958 0.6508 
6 4.793956 0.5632 
7 5.604129 0.4832 
8 6.380848 0.4192 
9 7.228321 0.3602 
10 7.920860 0.3183 


Table A2. Metrics of genetic signal for each decrease in cohort size 


Cohort SNP Accuracy GWAS Sensitivity Heritability 
decrease 

0% 1.0 1.0 0.1114 (00357) 
10% 0.8744 0.8 0.1071 (0.0397) 
20% 0.8713 0.8 0.0867 (0.037) 
30% 0.3436 1.0 0.082 (0.0332) 
40% 0.1392 0.4 0.0497 (0.0216) 
50% 0.0477 0.4 0.039 (0.0287) 


60% 0.0019 0.2 0.0384 (0.0288) 
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39,624 Main study cohort 


16,369 excluded for GWAS 
QC, relatedness, and 
ancestry 


22,859 GWAS cohort 


Figure A1. Cohort diagram 


P-value (-log10 scale) 


Figure A2. Manhattan plot for genome-wide association study on corrected left-ventricular ejection 
fraction. 
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Figure A3. Q-Q plots of P values from GWAS summary statistics for different percentages of cohort 
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Figure A4 Impact of cohort decrease and noise generation on GWAS power. a, Regression analysis on 
the impact of measurement error quantified by a mean absolute error on sensitivity. b, Regression 
analysis on the impact of the mean absolute error on SNP accuracy. c, Regression analysis of the impact 
of cohort size decline on sensitivity. d, Regression analysis of the impact of cohort size decline on SNP 


accuracy 


145 


Pacific Symposium on Biocomputing 2024 


References 


l. 


10. 


11. 


12. 


13. 


14. 


15. 


16. 


17. 


18. 


Pirruccello, J. P. et al. Genetic analysis of right heart structure and function in 40,000 
people. bioRxiv (2021) doi:10.1101/2021.02.05.429046. 

Pirruccello, J. P. et al. Analysis of cardiac magnetic resonance imaging in 36,000 
individuals yields genetic insights into dilated cardiomyopathy. Nat. Commun. 11, 2254 
(2020). 

Agrawal, S. et al. Inherited basis of visceral, abdominal subcutaneous and gluteofemoral fat 
depots. Nat. Commun. 13, 3771 (2022). 

Haas, M. E. et al. Machine learning enables new insights into genetic contributions to liver 
fat accumulation. Cell Genom 1, (2021). 

Liu, D. J. et al. Exome-wide association study of plasma lipids in >300,000 individuals. Nat. 
Genet. 49, 1758—1766 (2017). 

Bai, W. et al. Automated cardiovascular magnetic resonance image analysis with fully 
convolutional networks. J. Cardiovasc. Magn. Reson. 20, 65 (2018). 

Meyer, H. V. et al. Genetic and functional insights into the fractal structure of the heart. 
Nature 584, 589—594 (2020). 

Vasan, R. S. et al. Genetic variants associated with cardiac structure and function: a meta- 
analysis and replication of genome-wide association data. JAMA 302, 168—178 (2009). 
Carroll, R. J. et al. Nonparametric Prediction in Measurement Error Models [with 
Commants]. J. Am. Stat. Assoc. 104, 993—1014 (2009). 

Farsalinos, K. E. et al. Head-to-Head Comparison of Global Longitudinal Strain 
Measurements among Nine Different Vendors: The EACVI/ASE Inter-Vendor Comparison 
Study. J. Am. Soc. Echocardiogr. 28, 1171—1181, e2 (2015). 

O’Dell, W. G. Accuracy of Left Ventricular Cavity Volume and Ejection Fraction for 
Conventional Estimation Methods and 3D Surface Fitting. J. Am. Heart Assoc. 8, e€009124 
(2019). 

Littlejohns, T. J. et al. The UK Biobank imaging enhancement of 100,000 participants: 
rationale, data collection, management and future directions. Nat. Commun. 11, 2624 
(2020). 

Bycroft, C. et al. The UK Biobank resource with deep phenotyping and genomic data. 
Nature 562, 203-209 (2018). 

Petersen, S. E. et al. UK Biobank’s cardiovascular magnetic resonance protocol. J. 
Cardiovasc. Magn. Reson. 18, 8 (2016). 

Suinesiaputra, A. et al. Fully-automated left ventricular mass and volume MRI analysis in 
the UK Biobank population cohort: evaluation of initial results. Jnt. J. Cardiovasc. Imaging 
34, 281-291 (2018). 

Bulik-Sullivan, B. et al. An atlas of genetic correlations across human diseases and traits. 
Nat. Genet. 47, 1236—1241 (2015). 

Petersen, S. E. et al. UK Biobank’s cardiovascular magnetic resonance protocol. J. 
Cardiovasc. Magn. Reson. 18, 1—7 (2016). 

Sanghvi, M. M. et al. Automatic left ventricular analysis with Inline VF performs well 
compared to manual analysis: results from Barts Cardiovascular Registry. J. Cardiovasc. 
Magn. Reson. 18, 1—2 (2016). 


146 


19. 


20. 


21. 


22. 


23. 


24. 


Pacific Symposium on Biocomputing 2024 


Yuan, N. et al. Systematic Quantification of Sources of Variation in Ejection Fraction 
Calculation Using Deep Learning. JACC Cardiovasc. Imaging 14, 2260—2262 (2021). 
Augusto, J. B. et al. Diagnosis and risk stratification in hypertrophic cardiomyopathy using 
machine learning wall thickness measurement: a comparison with human test-retest 
performance. Lancet Digit Health 3, e20—e28 (2021). 

Zekavat, S. M. et al. Deep Learning of the Retina Enables Phenome- and Genome-Wide 
Analyses of the Microvasculature. Circulation 145, 134—150 (2022). 

Kosaraju, A., Goyal, A., Grigorova, Y. & Makaryus, A. N. Left Ventricular Ejection 
Fraction. in StatPearls (StatPearls Publishing, 2022). 

Uffelmann, E. et al. Genome-wide association studies. Nature Reviews Methods Primers 1, 
1-21 (2021). 

Loh, P.-R., Kichaev, G., Gazal, S., Schoech, A. P. & Price, A. L. Mixed-model association 
for biobank-scale datasets. Nat. Genet. 50, 906—908 (2018). 


147 


Pacific Symposium on Biocomputing 2024 


A deep neural network estimation of brain age is sensitive to cognitive impairment and 
decline 
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The greatest known risk factor for Alzheimer’s disease (AD) is age. While both normal aging and 
AD pathology involve structural changes in the brain, their trajectories of atrophy are not the same. 
Recent developments in artificial intelligence have encouraged studies to leverage neuroimaging- 
derived measures and deep learning approaches to predict brain age, which has shown promise as a 
sensitive biomarker in diagnosing and monitoring AD. However, prior efforts primarily involved 
structural magnetic resonance imaging and conventional diffusion MRI (dMRI) metrics without 
accounting for partial volume effects. To address this issue, we post-processed our dMRI scans with 
an advanced free-water (FW) correction technique to compute distinct FW-corrected fractional 
anisotropy (FArweor) and FW maps that allow for the separation of tissue from fluid in a scan. We 
built 3 densely connected neural networks from FW-corrected dMRI, Tl-weighted MRI, and 
combined FW+T1 features, respectively, to predict brain age. We then investigated the relationship 
of actual age and predicted brain ages with cognition. We found that all models accurately predicted 
actual age in cognitively unimpaired (CU) controls (FW: r=0.66, p=1.62x10°?; T1: r=0.61, 
p=1.45x10?°, FW+T1: r=0.77, p=6.48x10°°) and distinguished between CU and mild cognitive 
impairment participants (FW: p=0.006; T1: p=0.048; FW+T1: p=0.003), with FW+T1-derived age 
showing best performance. Additionally, all predicted brain age models were significantly associated 
with cross-sectional cognition (memory, FW: B=-1.094, p=6.32x10°7; T1: B=-1.331, p=6.52x107; 
FW+T1: B=-1.476, p=2.53x10"°; executive function, FW: B=-1.276, p=1.46x10°; T1: B=-1.337, 
p=2.52x107; FW+T1: B=-1.850, p=3.85x10°!”) and longitudinal cognition (memory, FW: B=-0.091, 
p=4.62x10"'; T1: B=-0.097, p=1.40x10°; FW+T1: B=-0.101, p=1.35x10°!'; executive function, FW: 
B=-0.125, p=1.20x10°'°; T1: B=-0.163, p=4.25x 10°"; FW+T1: B=-0.158, p=1.65x 10°"). Our findings 
provide evidence that both T1-weighted MRI and dMRI measures improve brain age prediction and 
support predicted brain age as a sensitive biomarker of cognition and cognitive decline. 


Keywords: Alzheimer’s disease, free-water correction, deep neural network, cognition 


© 2023 The Authors. Open Access chapter published by World Scientific Publishing Company and distributed under 
the terms of the Creative Commons Attribution Non-Commercial (CC BY-NC) 4.0 License. 
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1. Introduction 


Alzheimer’s disease (AD) is a progressive neurodegenerative disorder whose greatest known risk 
factor is advancing age. Both normal aging and AD are accompanied by structural changes in the 
brain, but they follow distinct trajectories. Specifically, healthy aging typically exhibits global 
reductions in gray matter volume!” characterized by volume loss in frontal and temporal lobes** 
and enlargement of ventricles’, whereas AD-related brain atrophy typically starts in the 
hippocampus and gradually spreads to the entire brain”. Additionally, studies have shown that AD 
brains undergo deterioration more rapidly than healthy brains®. Given these differences, there arose 
recent efforts of using neuroimaging-derived measures of gray matter volume from T1l-weighted 
magnetic resonance imaging (MRI) and white matter microstructure from diffusion MRI (dMRI) to 
predict an individual’s “brain age” via machine learning approaches? '”, which can differ from their 
chronological age and predict cognitive decline’*’. These models were trained on cognitive 
unimpaired individuals to learn common patterns in healthy aging, which then allowed them to 
detect aging-related abnormalities such as those associated with AD. A larger difference between 
brain age and chronological age indicates that the individual is on an accelerated trajectory 
compared with normal aging and is typically seen in individuals with cognitive impairment (e.g., 
mild cognitive impairment [MCI], AD)'*'!8, suggesting the potential of brain age as a sensitive 
biomarker along the AD continuum. Moreover, the development of the free-water (FW) correction 
post-processing technique!’ has enabled the partition of a conventional fractional anisotropy (FA) 
map into a FW-corrected FA map (FArweor) and a FW map; the FArwcor and FW metrics 
individually describe tissue and fluid, thereby enhancing the biological specificity of dMRI scans. 
Recently, our group has demonstrated that abnormal FW-corrected dMRI metrics are associated 
with higher rates of longitudinal cognitive decline and diagnosis along the AD clinical 
continuum”””!, These findings suggest that incorporating FW-corrected metrics into models of 
predicted brain age may provide more sensitive associations with cognitive impairment and decline. 

The present study leveraged neuroimaging data from a longitudinal cohort of aging to build three 
densely connected neural networks using FW-corrected dMRI, T1l-weighted MRI, and combined 
FW+T1 features to predict participant brain age. To evaluate model performance, we examined the 
relationship between predicted brain age and chronological age. We then investigated the 
association between predicted brain age and two domains of cognition (memory and executive 
function performance at baseline and over time). We hypothesized that FW-, T1-, and FW+T1- 
derived models would all accurately predict participant brain age, with the FW+T1-derived model 
showing the best performance as it incorporates both gray and white matter regions. We also 
hypothesized that all predicted brain age models would predict baseline and longitudinal memory 
and executive function performance, with FW+T1-derived brain age showing the strongest 
associations. 


2. Methods 


2.1. Participants 


All data leveraged in the present study were obtained from the Vanderbilt Memory and Aging 
Project (VMAP)*”, a longitudinal observational study that was launched in 2012 and recruited 
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individuals 60 years and older who speak English, have adequate auditory and visual capacity for 
testing, and have a stable study partner. Participants underwent comprehensive neuropsychological 
assessment and were categorized into cognitively unimpaired (CU) or MCI status; MCI participants 
were age-, sex-, and race-matched with CU controls. Cognitive (memory, executive function) 
measures were obtained from all participants and neuroimaging (Tl-weighted MRI, dMRI) 
measures were obtained from a subset of participants. Only participants who had all necessary 
cognitive and neuroimaging data were included in the present study (n=295). All protocols for 
VMAP were approved by the IRB at Vanderbilt University Medical Center and all participants gave 
voluntary informed consent prior to enrollment. Data from the VMAP cohort can be freely accessed 
following approval (vmacdata.org). Table 1 summarizes demographic and clinical information for 
the present cohort. 


Table 1. Vanderbilt Memory and Aging Project Cohort Information 
Diagnosis at Baseline 


Measure CU MCI p-value 

Cohort Characteristics 

Number of participants 168 127 - 

Total number of visits 568 372 - 

Longitudinal follow-up (years) 3.10 (1.44) 2.96 (1.43) 0.230 
Demographics and Health Characteristics 

Age at baseline (years) 73.10 (7.16) 73.67 (7.41) 0.504 

Sex (% female) 42.26 42.52 1.000 

Education (years) 16.39 (2.48) 15.09 (2.75) <0.001 

Race (% non-Hispanic White) 86.90 86.61 1.000 

APOE-s4 (% positive) 29.76 44.09 0.016 


Mean (standard error) are provided unless otherwise indicated. Abbreviations: CU, cognitively unimpaired; 
MCI, mild cognitive impairment; APOE, apolipoprotein. Boldface signifies p<0.05 unless otherwise indicated. 


2.2. Neuroimaging data acquisition and preprocessing 


T1-weighted MRI images (repetition time: 8.9 ms, echo time: 4.6 ms, resolution: 1 mm isotropic) 
were obtained from each participant on 3T Philips Achieva using an 8-channel SENSE reception 
coil and underwent multi-atlas segmentation to calculate the volumes of 132 regions of interest 
(ROD. All measures were normalized by total intracranial volume, calculated as the volumetric 
sum of all 132 segmented ROIs. dMRI images (resolution: 2 mm isotropic, b-values: 0, 1000 s/ mm”, 
number of directions: 32) were obtained from each participant using the previously described 
scanner and preprocessed using PreQual™. FW and FW-corrected metrics were calculated in 
MATLAB from the preprocessed images, as previously described!?. The FW and FArweor maps 
were transformed by a non-linear warp using the ANTs package to create a standardized space 
representation. Finally, publicly available tractography templates (https://github.com/VUMC- 
VMACYTractography_Templates) were applied to the FW and FArweor maps to quantify white 
matter microstructure within 48 tracts. 

T1-weighted MRI and FW-corrected dMRI metrics (FArweorr, FW) were harmonized separately 
using Longitudinal Combat” in R (version 4.1.2), controlling for age at baseline, education, sex, 
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race/ethnicity, APOE-¢4 positivity, APOE-«2 positivity, and the interaction of age at baseline with 
time interval from baseline. We also included the random effects of intercept and time interval from 
baseline for each participant and a batch variable that accounted for all combinations of image 
acquisition. The batch variable was scanner x software x coil for T1 metrics and site x scanner x 
protocol for FW-corrected metrics. 


2.3. Neuropsychological metrics calculation 


Participants completed comprehensive neuropsychological testing administered by experienced 
technicians which assessed multiple cognitive domains, including memory and executive function. 
Psychometrically sound memory and executive function composite scores were calculated from 
item-level data. Longitudinal cognitive measures (memory slope, executive function slope) for each 
participant were obtained by calculating the random effect coefficient using a linear mixed-effects 
model where the fixed effect was time interval from baseline and the outcome was composite score. 


2.4. Brain age prediction model architecture 


In the present study, we used a densely connected neural network to predict participants’ brain age 
based on neuroimaging regions (i.e., features) and created three separate models using FW, T1, and 
combined FW+T1 features. Figure 1 shows an overview of model workflow. Each model consists 
of four layers: an input layer whose dimensions correspond to the number of features (FW: 96 
features, T1: 132 features, FW+T1: 228 features), two densely connected layers with rectified linear 
unit (ReLU) activation whose number of nodes equals half and a quarter of the number of features, 
respectively, and an output layer with a single node and linear activation for brain age prediction. 

All models were trained on baseline neuroimaging data from the VMAP cohort by subsetting 
all imaging sessions to the first visit of CU participants. We minimized the loss function as 
characterized by mean absolute error (MAE) while monitoring the mean squared error (MSE) and 
root mean squared error (RMSE). We conducted ten-fold cross-validation where 90% of the data 
were used for training and 10% of the data were reserved for testing in each fold, repeating this 
process ten times until the entire dataset had been tested only once. Within the training data for each 
fold, 80% were used to train the model and 20% were used to validate model performance. During 
each fold, training was stopped when the loss function on the validation dataset had not improved 
for 15 epochs and only the best model was saved. For each set of features (FW, T1, FW+T1), saved 
models were compared across folds and the one which yielded the lowest validation loss was 
selected as the final model. All models were developed in Python (version 3.9.13) using the Keras 
library (version 2.9.0) with Tensorflow backend (version 2.9.1). We used the three final models to 
generate FW, T1, and FW+T1 predicted brain ages for all participants (CU, MCD at all timepoints 
(baseline, longitudinal follow-ups). 
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Figure 1. Model workflow for brain age prediction. We created three separate, densely connected neural 
networks to predict brain age, including FW-derived (A), T1-derived (B), and FW+T1-derived (C) models. 


For each model, we computed SHAP (SHapley Additive exPlanation) values for all relevant 
neuroimaging features to quantify their contribution to age prediction. Figure 2 shows the top 10 
most important features for each model based on mean SHAP value. 
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Figure 2. Top 10 most important features for FW-derived (A), T1-derived (B), and FW+T1-derived (C) models. 


Boldface signifies top features involved in aging and AD, including superior longitudinal fasciculus (SLF) 
FArweor, fornix FW for the FW-derived model and left thalamus proper, 3 ventricle for the T1-derived model. 


10 ).01 


2.5. Statistical analyses 


All statistical analyses were conducted in Python (version 3.9.13) and R (version 4.1.2). We first 
performed simple linear regression between actual age and each predicted age to assess model 
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performance as well as independent groups t-tests to compare the mean actual age and mean 
predicted brain ages of CU and MCI participants. We also conducted logistic regression analyses, 
using actual and each predicted brain age as direct predictor of diagnostic category, then evaluated 
model performance using area under the receiver operator characteristic curve (ROC-AUC) and 
DeLong’s test. Next, we conducted a series of linear models and competitive model analyses to 
assess actual and predicted brain age association with cognition. All models covaried for diagnosis, 
race/ethnicity, sex, education, and APOE-¢4 positivity. Significance was set a priori at a=0.05. For 
baseline cognition, actual age and predicted brain ages (FW, T1, FW+T 1) were included in a general 
linear model individually to determine their main effects on baseline memory and executive 
function. We then introduced age-by-diagnosis interaction terms to the linear models to investigate 
the potential modifying effect of age on baseline memory and executive function scores. Finally, 
we conducted post-hoc competitive model analysis to determine the unique variance in baseline 
memory and executive function contributed by FW, T1, and FW+T1 predicted brain age, beyond 
that contributed by covariates and actual age. The described analyses were repeated for longitudinal 
cognition (longitudinal memory slope, longitudinal executive function slope). 


3. Results 


Participant characteristics of the VMAP cohort are shown in Table 1. There were no significant 
differences in longitudinal follow-up interval, age at baseline, sex, or race between diagnostic 
groups (CU, MC). The CU group had more years of education and lower APOE-¢4 positivity than 
the MCI group. 


3.1. Combined model using free-water (FW) and T1 features showed best performance 


Figure 3 shows the agreement between predicted brain age measures (FW, T1, FW+T1) and actual 
age; model performance was characterized using average mean absolute error (MAEavyg) and average 
mean squared error (RMSEavg) across folds and Pearson’s correlation through ten-fold cross 
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Figure 3. Bland Altman plots for FW-derived age (A), T1-derived age (B), and FW+T1-derived age 

(C). All models accurately predict age. FW+T1-derived age is most significantly associated with actual 
age, in comparison with FW-derived age and T1-derived age. 
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validation. While all predicted brain ages significantly predicted actual age (FW: MAEayg= 
0.115, RMSEave=0.129, r=0.66, p=1.62x10°*; T1: MAEave=0.106, RMSEave=0.114, r=0.61, 
p=1.45x10°), the combined FW+T1 model yielded the best performance with highest r as well 
as lowest MAEave and RMSEave (MAEave=0.072, RMSEave=0.087, r=0.77, p=6.48x 10°"). 

We then compared means of actual age and predicted brain ages between CU and MCI 
participants. While there was no difference in actual age between CU and MCI groups 
(agecu=73.07+7.24, agemci=72.83+6.92, p=0.792), all predicted brain ages for the MCI group were 
significantly higher than those for the CU group (FW: agecu=72.08+5.55, agemc=74.18+6.16, 
p=0.006; T1: agecu=67.52+4.96, agemc1=68.82+5.27, p=0.048), with the combined FW+T1 model 
showing the largest difference (agecu=71.74+5.58, agemci=73.93+5.67, p=0.003). 

Figure 4 shows the Receiver Operating Characteristic curves for actual and predicted brain ages 
in predicting diagnostic category (CU, MCI). Pairwise comparisons revealed that ROC-AUC values 
for all predicted brain ages were significantly greater than that of actual age (FW-actual: p=0.003; 
Tl-actual: p=0.030; FW+T1-actual: p=0.004); however, no differences were found between the 
predicted brain ages (all p>0.05). 
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Figure 4. Receiver Uperaung Unaracterisuc curves tor actual, r w -preaicted, T1-predicted, and 
FW+T1-predicted age in predicting diagnostic category. All predicted ages performed significantly 
better than actual age, but no difference in performance was found between predicted ages. 


3.2. Predicted brain age association with baseline cognition 


Actual age and predicted brain age (FW-derived, T1-derived, FW+T1-derived) associations with 
cross-sectional cognition (memory, executive function) are shown in Figure 5. While all models 
significantly predicted memory score at baseline (Actual: Rag=0.497, p=1.23x10**; FW: 
Raaj =0.481, p=4.14x 10°; T1: Raaj =0.481, p=4.26x 10°), the combined FW+T1 model showed the 
most robust performance (Rag=0.513, p=2.31x107°). Similarly, all models significantly predicted 
executive function score at baseline (Actual: Raaj =0.472, p=3.22x10?; FW: Raaj =0.445, 


p=1.24x10”; T1: Raj’ =0.422, p=1.69x10°7) and the combined FW+T1 model was the most robust 
(Raj =0.519, p=5.81x10°. When examining main effect associations of each respective age 
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variable, we saw that actual age and all predicted brain ages each had a significant main effect on 
baseline memory score (Figure 5A; Actual: B=-1.162, p=1.58x10°; FW: B=-1.094, p=6.32x107; 
T1: B=-1.331, p=6.52x10-’), with the combined FW+T1 predicted brain age showing the strongest 
relationship (B=-1.476, p=2.53x10"!°). Likewise, we saw significant age effects for actual and all 
predicted ages on baseline executive function score (Figure 5B; Actual: B=-1.371, p=2.98x10°; 
FW: B=-1.276, p=1 .46x10°; T1: B=-1.337, p=2.52x107), with the combined FW+T1 predicted brain 
age showing the strongest relationship (B=-1.850, p=3.85x10 1). We found no significant 
interactions between actual or predicted brain ages and diagnostic status on baseline memory or 
executive function. 


A. Actual and Predicted Age Associations with Baseline Memory Score 
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Figure 5. Actual and predicted age associations with baseline cognition. Actual and all derived ages 
are significantly associated with baseline memory (A) and executive function performance (B); 
FW+T1-derived age shows highest associations. Datapoint colors: green=CU; orange=MCI. 


Table 2 summarizes results of the competitive model analysis on cross-sectional cognition. We 
found that covariates alone explained approximately 43% of the variance in baseline memory score 
(Rag"=42.60%) and the addition of actual age led to an increase in overall model performance 
(AR? gj=6.92%). We then iteratively added predicted brain ages to this model to determine whether 
FW, T1, or FW+T1 predicted brain age contributed to any unique variance beyond covariates and 
actual age. While FW and T1 predicted brain ages were not found to be a significant contributor to 
baseline memory score, we observed that the combined FW+T1 predicted brain age significantly 
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added to the model and led to increased Raaj? (FW+T1: AR*aq=1.74%). Similarly, covariates alone 
explained approximately 36% of the variance in baseline executive function score (Raaj =35.70%) 
and the addition of actual age led to a drastic increase in model performance (AR7aqj=11.58%). When 
iteratively adding each predicted brain age to the model to determine its unique contribution beyond 
covariates and actual age, we observed that both FW and FW+T1 predicted brain age explained 
additional variance in baseline executive function score, with FW predicted brain age leading to a 
small increase in Raaj? (AR7aq=0.63%) and FW+T1 predicted brain age leading to a large increase 


Table 2. Comparison of Actual and Predicted Age Main Effects on Baseline Cognition 


Baseline Memory Score Baseline Executive Function Score 
B SE t p AR? adj B SE t P AR? adj 
Covariates + 
Actual age -1.162 0.199 -5.852 <0.001 6.915 -1.371 0.186 -7.355 <0.001 11.576 
Covariates + actual age + 
Predicted age 
FW -0.474 0.287 -1.652 0.100 0.358 -0.531 0.269 -1.978 0.049 0.630 
Tl -0.643 0.327 -1.964 0.051 0.590 -0.376 0.308 -1.219 0.224 0.106 
FW+T1 -1.132 0.365 -3.105 0.002 1.743 -1.628 0.333 -4.892 <0.001 4.585 


in Raaj (AR*aqj=4.59%). However, T1 predicted brain age did not provide a significant increase to 
the model. 


3.3. Predicted brain age association with longitudinal cognition 


Actual age and predicted brain age associations with longitudinal cognition are shown in Figure 
6. While all models significantly predicted longitudinal memory slope (Actual: Raaj =0.427, 
p=5.08x10°78; FW: Raaj =0.439, p=4.90x10; T1: Raj=0.412, p=1.16x107°), the combined 
FW+T1 model showed the most robust performance (Raaj?=0.444, p=1.50x10°). Similarly, all 
models significantly predicted longitudinal executive function slope (Actual: Raaj =0.424, 
p=9.20x1078; FW: Raaj =0.404, p=5.89x107°; T1: Raaj’ =0.420, p=2.38x10?7) and the combined 
FW+T1 model was the most robust (Raaj =0.446, p=1.13x10 °). When examining the age effect, we 
saw that actual age and all predicted brain ages each had a significant main effect on longitudinal 
memory slope (Figure 6A; Actual: B=-0.082, p=5.30x10°!°; FW: B=-0.091, p=4.62x107''; T1: B=- 
0.097, p=1.40x10°8), with the combined FW+T1 model showing the strongest relationship (B=- 
0.101, p=1.35x10''!). Likewise, we saw significant main effects for actual and all predicted brain 
ages on longitudinal executive function slope (Figure 6B; Actual: B=-0.128, p=1.58x10"'7; FW: B=- 
0.125, p=1.20x10"!; T1: B=-0.163, p=4.25x10"!), with the combined FW+T1 model showing the 
strongest relationship (B=-0.158, p=1.65x10"'*). We found no significant interactions between actual 
age or predicted brain ages and diagnostic status on longitudinal memory or executive function. 

Table 3 summarizes results of the competitive model analysis on longitudinal cognition. We 
found that covariates alone explained approximately 33% of the variance in longitudinal memory 
slope (Raaj =33.10%) and the addition of actual age led to an increase in overall model performance 
(AR? ,aj=9.68%). We then added predicted brain ages to this model one at a time to determine 
whether FW, T1, or FW+T1 predicted brain age contributed to any unique variance beyond 
covariates and actual age. We observed that all predicted brain ages were significant contributors to 
longitudinal memory slope and led to increases in Raaj (FW: AR? adj=2.36%; T1: AR7ag=1.17%; 
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FW+T1: AR? ,gj=2.01%). Similarly, covariates alone explained approximately 30% of the variance 
in longitudinal executive function slope (Raaj =29.50%) and the addition of actual age led to a drastic 
increase in model performance (AR7qj=12.99%). When iteratively adding each predicted brain age 
to the model to determine its unique contribution beyond covariates and actual age, we observed 


A. Actual and Predicted Age Associations with Longitudinal Memory Score 


p=5.297e-10* p=4.622e-11* p=1.403e-08* p=1.350e-11* 
0.10 4 0.10 
e e cu 
o V 
0.05 0.05 
g 0.00ṣ4 0.00 
o 
7 
5-0.05 J -0.05 
v 
= 
-0.10 4 -0.10 ae 
2% o? 
S e ° 
x a) o 
0.15 0.15 2 
— ~ 1 - z - - r r 
60 70 80 90 60 70 80 60 70 80 90 


Actual Age at Baseline 


FW Predicted Age at Baseline 


T1 Predicted Age at Baseline 


B. Actual and Predicted Age Associations with Longitudinal Executive Function Score 
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Figure 6. Actual and predicted age associations with longitudinal cognition. Actual and all derived 
ages are significantly associated with longitudinal memory (A) and executive function performance (B); 


FW+T1-derived age shows highest associations. 


Table 3. Comparison of Actual and Predicted Age Main Effects on Longitudinal Cognition 


Covariates + 
Actual Age 
Covariates + Actual Age + 
Predicted Age 
FW 
Tl 
FW+T1 


Longitudinal Memory Slope 


Longitudinal Executive Function Slope 


B SE t 
-0.082 0.013 -6.474 
-0.060 0.018 -3.372 
-0.057 0.016 -3.558 
-0.073 0.023 -3.128 


Pp AR? a4j B SE t Pp AR? adj 
<0.001 9.681 -0.128 0.017 -7.458 <0.001 12.989 
0.001 2.362 -0.060 0.025 -2.441 0.015 1.160 
<0.001 1.174 -0.097 0.028 -3.482 0.001 2.539 
0.002 2.014 -0.112 0.031 -3.564 <0.001 2.665 


that all predicted brain ages explained additional variance in longitudinal executive function slope 
and led to increases in Raaj (FW: AR7ag=1.16%, T1: AR7agj=2.54%; FW+T1: AR? agj=2.67%). 
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4. Discussion 


The present study created 3 densely connected neural network models to predict brain age using 
FW, T1, and combined FW+T1 neuroimaging features, respectively. We evaluated model 
performance by comparing actual age with FW, T1, and FW+T1 predicted brain age then 
investigated the relationships between different age variables with cross-sectional and longitudinal 
cognitive performance (memory, executive function). Specifically, we examined age effects on 
baseline and longitudinal memory and executive function performance and conducted post hoc 
competitive model analyses to determine the unique contribution provided by each predicted brain 
age to variance in cognitive function. We report 3 main findings. First, we found that predicted brain 
ages from all 3 deep learning models using different sets of neuroimaging features (FW, T1, 
FW+T1) were highly associated with actual age; top neuroimaging features shown in model SHAP 
plots (Figure 2) were also biologically relevant to aging and cognitive decline, such as superior 
longitudinal fasciculus (SLF) FArweor and fornix FW in the FW model and thalamus and 3" 
ventricle in the T1 model. Second, we found that all predicted brain ages differentiated CU from 
MCI participants and significantly predicted both cross-sectional and longitudinal cognitive 
performance. Finally, we found that, among all 3 models, FW+T1 predicted brain age was the 
strongest predictor of cross-sectional and longitudinal cognitive performance and contributed the 
largest unique variance in these outcome variables. 


4.1. Densely connected neural network robustly predicts age using neuroimaging features 


We found that predicted brain ages generated by a densely connected neural network using 3 distinct 
sets of neuroimaging features (FW-corrected dMRI, Tl-weighted MRI, combined FW+T1) all 
showed high correlation with actual age in baseline CU participants, which confirms findings from 
previous literature that have accurately predicted chronological age of healthy adults using 
neuroimaging-derived measures with machine learning approaches including deep learning !!!”1726- 
°° Importantly, the top-contributing neuroimaging features identified for each model (Figure 2) 
provide biological interpretability as they include brain regions that have been associated with both 
normal aging and AD neuropathology. For instance, previous evidence has shown that thalamic 
volume, the most important feature identified in the T1 model, decreases with advancing age*” 
independently from total brain volume loss and correlates with cognitive speed and verbal memory 
performance!” Similarly, the identification of 3" ventricle volume as the second most important 
feature in the T1 model is consistent with prior literature which demonstrated that ventricular 
expansion is associated with normal aging and expands at an accelerated rate in individuals with 
cognitive impairment (MCI, AD)**4 or AD-related pathology*>. Among top features identified for 
the FW is the SLF, which is a white matter tract projecting from the occipital, parietal, and temporal 
lobes to the frontal cortex and is involved in language, attention, and memory*®. Specifically, 
conventional FA within the SLF has been shown to undergo stable decline between ages 30-65 and 
accelerated decline after age 65%. Likewise, integrity of the fornix — a limbic white matter tract 
projecting from the hippocampus**— has been shown to decline with normal aging? and to predict 
episodic memory”? and executive function performance‘! in both healthy older adults and 
individuals with neurological disorders. Most existing literature on brain age prediction using 
machine learning techniques has leveraged T1-weighted MRI measures or conventional dMRI 
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metrics. One significant advance in the present study is that we developed models using both T1- 
weighted and FW-corrected diffusion MRI data, and our results suggest that multi-modal MRI 
models may more accurately quantify brain age. 


4.2. Predicted age is a more sensitive measure than actual age and predicative of cognition 


We found that FW, T1, and FW+T1 predicted brain ages all differentiated CU from MCI patients 
by providing a significantly higher brain age for MCI patients even though the two groups did not 
differ in actual age, suggesting that predicted brain age may be a sensitive biomarker to AD clinical 
staging. This is consistent with previous research which computed predicted age difference (i.e., 
predicted brain age subtracted by chronological age) from T1l-weighted MRI scans and found 
significantly larger predicted age difference in amnestic MCI participants compared with healthy 
controls'®. Moreover, individuals with a higher predicted brain age at baseline were more likely to 
convert from MCI to AD* or develop dementia later in life'*. Studies generating predicted age 
difference from structural MRI scans of healthy controls have also found correlations with 
performance on traditional screening tools for AD (e.g., Mini-Mental State Examination, Clinical 
Dementia Ratio), anatomical measurements such as cortical thickness and hippocampal volume”, 
AD neuropathology such as B-amyloid positivity!**°, and AD risk factors such as APOE-s4 carrier 
status!©6, 

We also found that all predicted brain ages were robustly associated with cross-sectional and 
longitudinal cognitive function including baseline memory and executive function scores and 
longitudinal memory and executive function slopes. This agrees with prior literature that has found 
predicted age difference to be associated with memory and executive function impairment" as well 
as early signs of cognitive decline'*. However, the relationship between predicted age and both 
baseline and longitudinal cognitive function needs further clarification as one previous study found 
negative associations with psychomotor speed at baseline but no significant association with delayed 
recall performance or general cognitive status at baseline’. The present study supports predicted 
brain age as a sensitive biomarker along the AD continuum as it distinguishes between CU and MCI 
participants and is associated with memory and executive function performance at baseline and 
longitudinally. 


4.3. Application of neural networks in clinical medicine 


Deep learning algorithms, particularly neural networks, offer remarkable clinical utility by enabling 
researchers to harness complex patterns from large-scale data and consolidate this information into 
easy-to-use platforms. Prior neuroimaging studies have used deep learning methods to predict brain 
age”, however, the present study is the first to combine Tl-weighted and FW-corrected 
diffusion MRI data, shedding light on the potential of using multi-modal MRI to accurately predict 
brain age and use it as an endophenotype for cognitive impairment and decline, especially in the 
context of aging and AD. Importantly, our neural networks add weight to the idea that both gray and 
white matter features are important to consider in aging and AD. 
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4.4. Strengths and limitations 


The present study has several strengths, including a well-characterized longitudinal cohort with 
multi-modal MRI data with paired cognitive data. Regarding our neuroimaging analysis, one major 
strength is that we incorporated T1-weighted data in conjunction with FW-corrected diffusion MRI 
data, and this data was used as input into densely connected neural networks. Importantly, our data 
driven approach found that several aging related features (e.g., fornix integrity) were some of the 
highest contributing factors in our models. One limitation of this study is that it used a well- 
educated, mostly non-Hispanic white population, thus limiting our networks’ versatility. Future 
studies should incorporate more diverse populations to ensure that the neural networks are more 
generalizable. Moreover, although we have a large population with extensive longitudinal follow- 
up, one major limitation is that we only used data from a single cohort. Future studies leveraging 
multiple cohorts would drastically enhance our ability to predict brain age and likely improve its 
utility as an endophenotype for cognitive impairment and decline. 


4.5. Conclusions 


This study provided evidence that deep neural networks can be used to predict brain age, and that 
this predicted age is a strong predictor of cross-sectional cognitive impairment and future cognitive 
decline. Our findings provide evidence that using both T1l-weighted and FW-corrected diffusion 
MRI data improves our ability to predict brain age; thus, future studies should consider both gray 
and white matter features when building deep learning models in aging and AD. 
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Data from digital health technologies (DHT), including wearable sensors like Apple Watch, 
Whoop, Oura Ring, and Fitbit, are increasingly being used in biomedical research. Research and 
development of DHT-related devices, platforms, and applications is happening rapidly and with 
significant private-sector involvement with new biotech companies and large tech companies (e.g. 
Google, Apple, Amazon, Uber) investing heavily in technologies to improve human health. Many 
academic institutions are building capabilities related to DHT research, often in cross-sector 
collaboration with technology companies and other organizations with the goal of generating 
clinically meaningful evidence to improve patient care, to identify users at an earlier stage of 
disease presentation, and to support health preservation and disease prevention. Large research 
consortia, cross-sector partnerships, and individual research labs are all represented in the current 
corpus of published studies. Some of the large research studies, like NIH’s All of Us Research 
Program, make data sets from wearable sensors available to the research community, while the vast 
majority of data from wearable sensors and other DHTs are held by private sector organizations and 
are not readily available to the research community. As data are unlocked from the private sector 
and made available to the academic research community, there is an opportunity to develop 
innovative analytics and methods through expanded access. This is the second year for this Session 
which solicited research results leveraging digital health technologies, including wearable sensor 
data, describing novel analytical methods, and issues related to diversity, equity, inclusion (DEI) of 
the research, data, and the community of researchers working in this area. We particularly 
encouraged submissions describing opportunities for expanding and democratizing academic 
research using data from wearable sensors and related digital health technologies. 


Keywords: digital health technologies; wearables; sensors; waveform data; time-series data; 
algorithms. 


© 2023 The Authors. Open Access chapter published by World Scientific Publishing Company 
and distributed under the terms of the Creative Commons Attribution Non-Commercial (CC BY-NC) 4.0 
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1. Background 


Wearable devices and other digital health technologies (DHTs), such as smartwatches, fitness 
trackers, and smart rings, are becoming increasingly popular for tracking and monitoring a wide 
range of health and fitness metrics [1]. Figure 1 below reflects the growth of scientific 
publications with the word “wearable” with 5,713 papers published in 2022. A similar pattern is 
seen when searching for “digital health technology.” These devices can collect data on everything 
from heart rate and sleep patterns to activity levels and blood oxygen levels. In recent years, there 
has been a growing interest in using wearable devices and DHTs for health research. Wearable 
devices offer a number of advantages over traditional research methods, such as questionnaires 
and surveys. For example, wearable devices can collect data continuously and over long periods of 
time, providing researchers with a more complete picture of an individual's health and well-being. 
Additionally, wearable devices can be used to collect data in real-world settings, rather than in 
laboratory environments, which can provide more insights into how people behave in their 
everyday lives. 


RESULTS BY YEAR 


2022: 5,713 


Fig. 1. Number of publications with “wearable” in PubMed from 1966-2023, highlighting exponential growth 
of this subject and 5,713 papers published in 2022. 


Some specific examples of disease areas with active DHT and wearables research include: 

e Cardiovascular disease: Apple watch devices have been used to study the relationship 
between heart rate and physical activity levels, and to develop algorithms to predict the 
risk of heart disease [2]. 

e Respiratory disease: Fitbit devices have been used to study the effects of different types 
of air pollution on lung function, and to develop algorithms to detect early signs of asthma 
attacks [3]. 
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e Metabolic disease: Wearable devices tracking blood glucose and activity have been used 
to study gestational diabetes, and to develop algorithms to predict the risk of developing 
diabetes during pregnancy [4]. 

e Mental health: Wearable devices have been used to study the relationship between 
physical activity levels and mood, and to develop algorithms to predict symptom trajectory 
for bipolar disorder [5]. 


In addition to these specific examples, wearable devices and DHTs are also being used to study 
a wide range of other health conditions, such as cancer, infectious diseases, and chronic pain. But 
there are gaps in who has access to data and devices, who is performing the research and 
algorithm development, and therefore who the new technologies are poised to help improve health 
outcomes. Reviews of the current landscape of DHT research studies in the National Center for 
Biotechnology Information (NCBI)’s Clinical Trials database (clinicaltrials.gov), and of studies 
published by the top-20 funded private sector DHT companies, highlight several patterns and 
limitations: 

1. Small sample size: Aside from a few large studies, most of the published clinical trials 
utilizing DHT have been relatively small, and are largely under-powered. “Nearly half the 
studies - 829, or 46.5% - had less than 100 enrollees. Only 8% had more than 1,000 [6].” 

2. Narrow Health Focus: The majority of published DHT studies focus on cardiometabolic 
health and mental health/wellness, while relatively little published research examines 
critical healthcare burden diseases like stroke, chronic obstructive pulmonary disease 
(COPD), and diabetes [7]. 

3. Narrow Population Focus: Of studies published by the top 20 funded DHT private-sector 
companies, the majority (72%) include only healthy volunteers, rather than high-risk 
populations with comorbid conditions [8]. The breadth and diversity of the study 
population(s), including socioeconomic, healthcare status, and racial diversity, may be the 
most critical component of building Al-based DHT algorithms. This diversity is lacking in 
current published research, likely leading to biased results [9]. The “bring your own 
device” model has been used by many research studies, but this design may result in biased 
selection of participants, and therefore biased results [10]. 

4. Limited Outcome Assessments: Only 15% of published DHT studies measured clinical 
effectiveness, and only in relation to the patient outcomes and did not evaluate healthcare 
cost or access to care [11]. As healthcare cost and access are two of the most pressing 
needs in healthcare, it is important to expand research to examine these outcomes. 

5. Insufficient Reporting and Data Publishing: Importantly, not only is reporting in 
clinicaltrials.gov not required for observational DHT trials, there is also no public database 
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for DHT data and algorithms. This complicates the ability to understand the full range of 
DHT “real world evidence” (RWE)-based research, and undermines research 
reproducibility and validation. The lack of a consensus DHT database also means that 
DHT data curation, feature (e.g., digital biomarker) discovery, and algorithm development 
is limited to those who have data, which is largely the private sector DHT companies. One 
attempt to develop standardized pipelines and data repositories for digital health data, the 
Digital Health Data Repository as part of the Digital Biomarker Discovery Pipeline [12], 
developed by co-organizer Jessilyn Dunn’s lab, is still not fully funded. 

6. Bridging the Regulatory Gap and Moving to Clinical Implementation: Despite 
tremendous progress in DHT research and development, there is still a lot of work to be 
done along the research — regulatory — clinical implementation continuum. The All of Us 
Research Program is uniquely situated within NIH to interact with FDA colleagues and 
assist in developing regulatory standards for this new and uncharted territory. The FDA 
also has a Center for Digital Health Excellence, and there is a Digital Health Consortium, 
housed within the Office of the National Coordinator, for senior leaders within the federal 
government to convene across the digital health continuum. The Digital Medicine Society 
is a professional organization that has been working across sectors with the community to 
support innovation and standardization, in part via the Digital Health Measurement 
Collaborative Community (DATAcc) [13] and the Digital Health Playbook [14]. For 
clinical implementation, HumanFirst has built the Atlas precision measures platform, a 
cloud-based platform with endpoints and measures being researched using DHTs across 
the industry to help pharma and clinicians decide on which devices and how they can be 
used in clinical research and healthcare [15]. 


The above limitations don’t begin to address potential bias in algorithm development due to a 
limited pool of researchers interacting with these data. The purpose of this Session is to provide a 
forum for current research, address issues related to Diversity, Equity and Inclusion (DEI) in terms 
of the types of research and the researchers engaged, and ultimately to energize non-commercial 
research in the area. Our motivating question is how can this community work together to create 
more equitable research in the digital health tech space to benefit the research community and 
resulting impact? 


2. Relevance to biocomputing 


Computational biology approaches and algorithm development are critical enablers to the use 
of wearable devices and DHTs for biomedical research and health. Computational biologists are 
developing new methods for extracting meaningful insights from the large and complex datasets 
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collected by these devices; algorithm developers are developing new algorithms to improve the 
accuracy and reliability of wearable devices and DHTs. The continuous or near-continuous data 
streams from DHTs are ripe for machine learning and artificial intelligence (ML/AJ) research. 
Algorithms developed for detecting anomalies and other biomedically-related phenomena in 
wearable sensor data are increasingly being incorporated into research and moving into clinical 
practice and other health adjacent applications. 

Despite the many advantages of using wearable devices and DHTs for health research, there are 
also a number of challenges that need to be addressed. One challenge is that the data collected by 
these devices can be noisy and complex with significant levels of missing data, making it difficult 
to extract meaningful insights. Another challenge is that the algorithms used to analyze this data 
need to be carefully validated to ensure that they are accurate and reliable. There are also many 
different devices, and the community doesn’t yet have robust standards to compare between and 
among signals and data from different devices. 

In this session, we bridge these gaps across sectors and domains to identify opportunities for 
researchers in the PSB community to contribute to the growing biomedical research leveraging 
wearables and DHT to understand and improve health. In prior years of PSB, there has been good 
representation of a variety of data types, including genomics, imaging and clinical data sets; there 
has been limited coverage of wearable sensors and digital health technologies research. Last year, 
PSB2023, we hosted the first year of this Session [16]. We wanted to continue to support this 
conversation and topic area as this field continues to grow and obstacles to academic research 
continue to need to be overcome. Many of the other conferences where DHT computational 
researchers are more focused on the clinical aspects and clinical trials, and not as much on the 
computational biology or biomedical research aspects of DHT data analysis and algorithm 
development. 

The goal of this information sharing and discussion opportunity for participants and the 
community is to expand awareness and access to these data and tools, to enrich computational 
biology research, and bridge DEI gaps. The session includes a range of voices from academia, 
government, and private sector. It’s important to represent private sector voices in this discussion 
since much of the research is currently happening in tech companies developing digital health 
devices. Creating a forum for dialogue across sectors is important for bridging gaps in awareness 
and understanding, and encouraging more researchers to participate in developing computational 
methods and analysis of data from digital health tech. The papers and discussion will focus on key 
challenges facing the field, and participants are encouraged to contribute ideas to potential 
solutions and initiate lasting collaborations with researchers and communities in this area. The 
session will also provide an opportunity to discuss as a community what is needed to truly enable 
cross-sector and expanded research for digital health technologies. 
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3. Session overview 


The organizers will introduce the session, providing a background of the topic area, goals, and 
motivation for holding the session. There will then be a series of brief talks from the authors of the 
papers that were accepted for inclusion in the proceedings, a keynote by Vik Kheterpal from Care 
Evolution, ending with a panel discussion to include voices from academia, industry, and 
government including Q&A with attendees. The accepted papers/talks include causal data analysis 
of observational wearable device data, analysis of wearable silicone wristbands for chemical 
exposure, and digital biomarkers for detecting mild cognitive impairment. The talks are original 


research for publication, are widely varied, and the titles are listed below: 


e Expanding the access of wearable silicone wristbands in community-engaged research 
through best practices in data analysis and integration 

e Subject Harmonization of Digital Biomarkers: Improved Detection of Mild Cognitive 
Impairment from Language Markers 

e Scalar-Function Causal Discovery for Generating Causal Hypotheses with Observational 
Wearable Device Data 

e FedBrain: Federated Training of Graph Neural Networks for Connectome-based Brain 
Imaging Analysis (poster presentation only) 


Following the original research talks, the keynote will be offered by Vik Kheterpal, the CEO 
and founder of Care Evolution. Vik is a nationally recognized expert in the area of healthcare 
informatics who has been focused on healthcare data exchange and interoperability for the past 11 
years. He brings to the conversation the perspective of a serial entrepreneur working across IT, 
healthcare, and research sectors, and a go-to expert on real world data, healthcare IT, product 
design and usability, business, and leadership. After the keynote, attendees will be offered an 
opportunity to recommend DHT data collections and analysis methods that will help advance 
precision medicine research. This information will be shared with groups, such as the All of Us 
Research Program, that are collecting research data for the sake of advancing precision medicine. 

The session will conclude with a panel discussion and audience Q&A. The panelists will 
feature speakers from industry, academia, and government; the session organizers will be joined 
by the keynote speaker and paper authors for a moderated discussion and Q&A from the 
participants. Session attendees are encouraged to participate in an interactive discussion on the 
current research, current challenges, and opportunities to expand access and use of 


DHT/wearables data in research. 
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Wearable silicone wristbands are a rapidly growing exposure assessment technology that offer 
researchers the ability to study previously inaccessible cohorts and have the potential to provide a 
more comprehensive picture of chemical exposure within diverse communities. However, there are 
no established best practices for analyzing the data within a study or across multiple studies, thereby 
limiting impact and access of these data for larger meta-analyses. We utilize data from three studies, 
from over 600 wristbands worn by participants in New York City and Eugene, Oregon, to present a 
first-of-its-kind manuscript detailing wristband data properties. We further discuss and provide 
concrete examples of key areas and considerations in common statistical modeling methods where 
best practices must be established to enable meta-analyses and integration of data from multiple 
studies. Finally, we detail important and challenging aspects of machine learning, meta-analysis, and 
data integration that researchers will face in order to extend beyond the limited scope of individual 
studies focused on specific populations. 


Keywords: Silicone Wristbands, Wearables, Exposome, Environmental Health, Exposure Science, 
Public Health. 


© 2023 The Authors. Open Access chapter published by World Scientific Publishing Company and 
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1. Introduction 


Silicone wearables as passive sampling devices have emerged as a powerful and versatile 
personalized exposure assessment tool, allowing researchers to characterize chemical exposures for 
a wide variety of organic chemicals and study the impact of exposures on human health [1]. The use 
of silicone wearables in research, especially wristbands, has grown substantially since the first 
publication in 2014 [2]. Thousands of participants from several countries on six continents have 
worn wristbands [3] and there have been over 60 peer-reviewed papers published to date [1]. Since 
wristbands are easy-to-wear, do not require in-person consultation or training, and can be 
transported at ambient temperature in the mail back to the laboratory for analysis, they are a 
convenient choice for researchers and study participants alike even in challenging scenarios like 
disasters or pandemics [4-6]. 

However, despite the growing use of wristbands in research, the majority of individual wristband 
studies are limited due to small sample size and narrow population focus. In addition, no established 
best practices for analyzing wristband data across multiple studies exist, thereby limiting impact and 
access of these data for larger meta-analyses. Dixon et al. is the only study that has taken wristbands 
from multiple studies and reported trends in chemical exposure patterns across the globe [7]. In this 
paper, authors took wristband extracts from 14 different communities on three continents and re-ran 
those extracts on the same analytical method for the presence and absence of 1530 chemicals. 
Authors identified common chemical mixtures between geographically diverse participants. Dixon 
et al. also reported that wristbands worn in Texas post-Hurricane Harvey had the highest mean 
number of chemical detections compared with the other study locations, illustrating that comparing 
wristband studies from a diverse set of communities and geographical areas can highlight 
populations with unique chemical exposure profiles and therefore unique health risk profiles. 

Re-running wristband extracts from different studies on the same analytical method as done in 
Dixon et al. [7] is not a sustainable strategy for using wristband data to better understand broad 
exposure patterns and trends. We need new data analysis strategies to combine wristband data from 
multiple studies or use meta-analysis procedures, which would increase data access and 
interoperability. The growing number of individual wristband studies can be leveraged by 
combining data across studies to uncover patterns about personal chemical exposure, which can lead 
to new human health discoveries and can be used to direct research, interventions, and policy 
resources towards communities with higher exposure burdens or unique exposure patterns. In this 
manuscript, we present key considerations for analyzing wristband data and combining data 
collected from multiple studies. We use datasets from three studies to highlight challenges 
associated with data structure and missingness and the consequences of varying analysis techniques 
and choices between studies, which are often overlooked or not addressed in individual studies. 


2. Methods 


2.1. Study design and data collection 


Our paper illustrates data analysis and integration challenges using chemical exposure data from 
616 wristbands worn by participants in two study cohorts, one in New York City and one in Eugene, 
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Oregon. The New York (NY) wristband data was collected as part of an ongoing longitudinal birth 
cohort study at the Columbia Center for Children’s Environmental Health. Individuals pregnant with 
a singleton who were 18 years and older wore a wristband for 48 hours in their third trimester of 
pregnancy [8]. There are two sets of wristband data from the NY cohort that are included in this 
report, one set includes 22 wristbands from a pilot study collected between 2013 and 2015 [8] 
(referred to as “NY Pilot”) and the second set includes 168 wristbands from a larger study between 
2015 and 2019 (referred to as “NY”). We also include data from 426 wristbands worn by study 
participants in Oregon in 2017 and 2018 (referred to as “OR”). Study participants were asked to 
wear wristbands for seven consecutive days in two seasons (summer and winter), wearing a new 
wristband each day of the study. Study participants had to be 18 years or older, be diagnosed with 
mild to moderate asthma, be a current non-smoker, and live near Eugene, Oregon. 

All participants provided informed written consent in accordance with the Columbia University 
Institutional Review Board (IRB; #AAAK6753) for the NY cohort and in accordance with the 
Oregon State University IRB (#8058) for the OR cohort. 

We prepared, cleaned, and extracted all the wristbands as previously described [5]. We also 
created and analyzed several quality control samples throughout the wristband preparation, 
transport, and laboratory processing steps, which is described in Dixon et al. [5]. We analyzed the 
New York wristband extracts for 61 organic chemicals with an Agilent 7890B gas chromatograph 
(GC) paired with a 7000C triple-quadrupole mass spectrometer (MS/MS) [5]. We analyzed the OR 
wristband extracts for 94 organic chemicals using an Agilent 7890A GS interfaced with an Agilent 
5975B MS. Further analytical details can be found in Anderson et al. [9]. 


2.2. Data Processing 


We converted chemical concentrations to moles per gram wristband and applied a log 
transformation (log2 pmol/g wristband). We set the concentration value for a given chemical equal 
to NA if there was matrix interference (Section 3.1) [5]. We conducted analyses using the statistical 
software R, version 4.1.2 [10]. For each dataset, we filtered out chemicals which were not detected 
in any wristbands; this resulted in 53, 44, and 69 chemicals in the NY Pilot, NY, and OR datasets, 
respectively. We masked chemical names in the results as part of our de-identification process. 


3. Data Properties 


3.1. Types of missing and censored data 


Missing values in data can arise for a variety of reasons and are handled differently depending on 
the type of missing data. Missing data types are commonly grouped into three categories: missing 
completely at random (MCAR), missing at random (MAR), and not missing at random (MNAR). 
When data are MCAR, the probability of an observation being missing is unrelated to any other 
observed or unobserved factors. Missing values that can be completely explained by another 
observed variable or variables are MAR. When data are MNAR, the probability of an observation 
going missing is related to an unobserved variable or variables. There are two primary types of 
missing data in wristband studies: observations that are below the limit of detection (LOD) and 
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observations that are impacted by matrix interference (MI). Data values below LOD arise from a 
combination of all three missing data categories. The absence of a quantifiable peak for a sample 
and chemical of interest (MAR) will result in a missing annotation and results in a majority of the 
below LOD missing values. Much less frequently, a human error in data processing, such as deletion 
of a peak in quantification software (MCAR), or a participant’s failure to comply with study 
protocols, such as removing a wristband for part of a day resulting in low levels of measured 
chemicals not representative of true exposure may cause data to be annotated as missing and below 
LOD. Alternatively, MI occurs when a deuterated surrogate peak, used for quantification, is masked; 
this can arise when a wristband sample contains compounds from personal care products or sweat 
(MNAR). 


3.2. Handling missing data 


A majority of statistical and machine learning methods require complete observations (i.e. no 
missing values). Therefore, to leverage these techniques effectively, researchers must decide how 
to handle missing data, especially when using chemical concentrations from wristbands in a 
multivariate manner. One solution is to filter any samples or chemicals that contain missing values. 
We calculated the percentage of chemicals with complete observations across all samples for each 
of the three studies. Then, for each study we iteratively removed the sample with the most missing 
values and recalculated the percentage of chemicals with complete data. We summarized the 
percentage of chemicals with complete data at varying numbers of wristbands. The percentage of 
chemicals completely observed across all wristbands drops to 50% with only 4, 2, and 3 wristbands 
for NY Pilot, NY, and OR, respectively. A total of 13, 16, and 21 wristbands with the fewest missing 
values result in 25% of chemicals with complete observations for NY Pilot, NY, and OR, 
respectively. When study sizes grow to more than 165 wristbands for NY and OR, only one chemical 
is observed across all wristbands. 

Further, in targeted analytical methods, there is high confidence in the LODs and information 
about what chemicals are below LOD in wristband extracts contains meaningful data about what 
people are not being exposed to or are exposed to in very small amounts [9]. Thus, the large number 
of missing values in wristband data means data removal approaches will significantly diminish the 
size and information in the data and may introduce bias if missing observations are MNAR. 

An alternative approach is imputation of missing observations. The most common imputation 
approach taken in wristband studies is the replacement of the below LOD missing values with a 
constant value, such as half the LOD (e.g. [8, 11-13]). Unlike many other mass spectrometry-based 
measurement fields (e.g. proteomics), a vast majority of below LOD values are due to the true 
absence of a quantifiable peak, thus half the LOD values are reasonably close to the true 
unobservable values. However, imputation using a constant value likely does not reflect the true 
values if they could be measured and can significantly affect the covariance structure of the data, 
resulting in differences in common downstream analyses, such as principal component analyses 
(e.g. [11, 14]). As an example, we ran principal component analysis via projection pursuit (PPCA) 
[15] on the NY Pilot data without imputing missing values (Fig. 1A) and ran k-means clustering 
[16] based on the first two principal component scores, setting k=4 based on the optimal number of 
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clusters as determined by evaluating the silhouette score [17]. Additionally, we ran PPCA on the 
NY Pilot data where missing values below the LOD were imputed with half the LOD (Fig. 1B) with 
samples colored by the clusters assigned based on PPCA results without imputation. The percentage 
of variability explained by each component is considerably different for the two analyses, and 
although some samples clustered similarly, several samples formed much different clusters when 
PPCA was run with imputed values. For example, samples 15, 16, and 22 cluster at the top left in 
Fig. 1B and the same behavior is not observed in Fig. 1A. Further, we examined the loadings of 
each chemical on the first principal component (PC1) as seen in Fig. 1C. Large differences in 
loadings were observed with many chemicals having very little influence on scores in the non- 
imputed PPCA but having large positive loadings in the half LOD PPCA. 
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Fig. 1. PC1 and PC2 loadings from PPCA for the NY Pilot data for (A) no imputation of missing values 
and (B) imputation of missing values with half the LOD. Samples are colored by cluster as determined by 
k-means clustering based on the PC1 scores from PPCA results without imputation. (C) PC1 loadings for 
each chemical when missing values were not imputed (green dots) and when missing values were imputed 
with half the LOD (red dots). 


Alternatively, missing values can be imputed using more complex algorithms. These methods 
provide the benefit of introducing variability in imputed values, unlike imputation of half the LOD. 
A few wristband studies have utilized these approaches to date (e.g. [18]) for all types of missing 
values. We applied two such example imputation methods to the OR dataset. We imputed missing 
values using two different methods, random forest imputation [19] and multiple imputation by 
chained equations (MICE) using the predictive mean metric [20], for chemicals with no more than 
40% missing observations. Fig. 2 shows a comparison of imputed values generated by the two 
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methods for observations that were missing due to MI and being below LOD. While some imputed 
values are very close to one another for the two methods, there are many values that differ by orders 
of magnitude. Further, the imputation methods were unable to differentiate between the missing 
value mechanisms, as below LOD and MI observations overlap, nor were they able to impute values 
below the LOD in nearly all cases. In general, empirically-driven imputation methods are 
insufficient for imputation of observations below LOD as imputed values are often much larger than 
the LOD (Fig. 2), which is not consistent with the results of the chemical analysis. Even methods 
aimed at imputation of left-censored data (e.g. [21]) rely on minimum observed values in the dataset 
and are still orders of magnitude larger than known LODs, as these algorithms have been designed 
for different application areas. On the other hand, these imputation methods use the structure of the 
data to fill in missing observations, and can be useful for resolving missing observations due to MI. 
The choice of imputation method and underlying assumptions should be carefully considered, as 
they can lead to significantly different interpretation of a dataset in downstream analyses. Further, 
no guidance exists nor have any thorough reviews been conducted to determine the threshold of 
detection rate for a chemical to be included in an analysis and imputed. Researchers have used a 
wide range of thresholds ranging from 20% [22] to 75% [23] of observations detected for a chemical 
to be included in downstream analyses. At a minimum, researchers should conduct sensitivity 
analyses to evaluate the effect of chosen threshold on their results. 
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Fig. 2. Comparison of MICE and RF imputation methods for two types of missing chemical data, data 
that is below the LOD (gray dots) and MI (blue dots), in the OR study. The dotted black vertical and 
horizontal lines represent the median LOD across chemicals. 


The final potential solution is to develop or use novel analysis techniques which are tolerant to 
missing or that can partition sources of variability, so they are not skewed by large numbers of 
constant values near LOD. For example, in the proteomics field, a statistical analysis technique 
which combined quantitative and qualitative (presence/absence) models was developed to 
accommodate and utilize missing observation information [24]. Further research and development 
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of new data analysis models and methods for wristband data is needed to specifically address the 
challenges described here for wearable wristband data. 


3.3. Distributional properties of concentrations 


Understanding the underlying distribution of chemical concentrations measured by wristbands is of 
fundamental importance to select appropriate statistical models and analyses to conduct. Fig. 3 
shows the distribution of three chemicals from the OR study, with half the LOD filled in for 
observations below LOD. Within and across chemical observations from wristbands vary by orders 
of magnitude resulting in distributions heavily skewed to the right (Fig. 3A). Log transformation is 
a common technique to stabilize variances and transform skewed data distributions to approximately 
normal distributions and is commonly used in wristband study analyses (e.g. [13, 25]). On a log- 
transformed scale, chemical concentrations above LOD can be reasonably approximated by a 
normal distribution (Fig. 3B-D). However, the full distribution of log concentrations is bimodal even 
for small numbers of observations below LOD, and the distance between observations above and 
below LOD depends on the chemical. 
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Fig. 3. Distribution of three chemicals from the OR study, with half the LOD filled in for observations 
below LOD (A and B) chemH, (C) chemPB, and (D) chemCB, with chemH visualized on both the (A) 
linear scale and (B) log scale. 


4. Data Analysis Methods 


4.1. Statistical methods and summaries 


After conducting a survey of existing literature detailing wristband studies, with a broad range of 
research applications and hypotheses, and their data analysis methods, several prevalent statistical 
methods and models emerge, including correlation, linear regression, basic hypothesis testing 
between groups, and logistic regression. However, despite a small number of statistical models and 
methods being used in the field, nuances and details of how data preprocessing and model fitting 
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are carried out vary widely. This will create significant problems as the number of wristband studies 
continues to grow and the research community attempts to combine results from multiple studies 
through techniques such as meta-analysis [26]. Meta-analysis can be a powerful tool for assessing 
the consistency and generalizability of results across multiple studies and study populations. 
However, study combination approaches require consistent statistical data processing and testing 
procedures. 

Perhaps the most common summary statistic used in wristband studies is the calculation of 
correlations between chemical concentrations measured by wristbands, between a chemical’s 
concentration profile from wristbands and the profile of another exposure assessment methodology 
(e.g. urine biomarkers) or to health outcomes. A majority of researchers utilize a non-parametric 
Spearman correlation, recognizing an assumption of both variables following a normal distribution, 
as required by metrics such as Pearson correlation, was not appropriate. However, some researchers 
have calculated Spearman correlation by imputing below LOD observations with half the LOD, or 
a similar small constant value (e.g. [8, 23, 27, 28]), while other researchers have chosen to calculate 
correlation using only observations above LOD (e.g. [29]). In the case of linear regression, chemical 
concentrations are used as the dependent variable or independent variable(s) depending on the 
research question. Some researchers choose to log-transform concentrations, impute half LOD 
values for below LOD observations, and limit chemicals used in analyses based on percentage of 
observations above LOD in an effort to avoid violation of the normality assumption of errors (e.g. 
[12, 23]). Other researchers restrict linear regression models to chemical concentrations above LOD 
[30]. Further, the choice of the minimum percentage of detections required per chemical for 
inclusion in analyses varies widely across studies. When testing for differences in concentrations 
between groups of interest, some researchers leverage non-parametric tests such as the Wilcoxon 
Test (e.g. [29]) and others conduct parametric statistics such as a t-test on log-transformed 
concentrations (e.g. [31]). 

Although differences in analyses mentioned above appear small or trivial in nature, the 
implications on results and conclusions drawn across studies using techniques such as meta-analysis 
are potentially large. For example, when testing differences in chemical concentrations between 
groups of interest, a t-test is testing for differences in the mean concentration levels, while the 
Wilcoxon Test conducts a test for differences in the distribution of values and is typically sensitive 
to differences in the median, depending on the sample size, but often not the mean unless the sample 
size is very large [32]. As an illustrative example of the downstream effect of differences in data 
treatment and modeling, we used the OR dataset. We considered two pairs of chemicals chemUB 
& chemT and chemH & chemT, for which nearly all wristbands had pairs of observations above 
LOD (94.8% and 93.7% of 426 wristbands, respectively). We first calculated the Spearman 
correlation between each pair of chemicals’ concentrations for all complete pairs of observations 
chemUB & chemT and chemH & chemT as 0.157 and 0.959, respectively. Then for each pair of 
chemicals, we synthetically introduced missing below LOD values into the data. At each iteration 
an additional observation was set to below LOD for 1 to 420 (0.2% to 99% of the data) wristbands 
and the Spearman correlation was calculated 1) with half the LOD imputed for missing values and 
2) ignoring missing below LOD observations. Fig. 4 shows the correlation values for the two pairs 
of chemicals. The treatment of observations below LOD causes the Spearman correlation to differ 
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considerably between the two methods even with small percentages of missing values. As the 
percentage of detections decreases, correlation calculated using imputation with half LOD inflates 
and effectively becomes a metric of correspondence between detections and non-detections between 
the two chemicals rather than measuring the strength of quantitative association, even above 
thresholds of filtering seen in literature (e.g. 70% marked by the vertical dashed line in Fig. 4). When 
ignoring missing values from the Spearman correlation calculation, values are centered around the 
true correlation value but have high variability as the percentage of below LOD observations gets 
larger. Fig. 4 clearly shows that these two methods for computing correlation are representative of 
different properties of the data when not all observations are above LOD. 
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Fig. 4. Spearman correlation for (A) chemUB & chemT and (B) chemH & chemT at varying levels of 
simulated missing values. Vertical lines indicate a common threshold for filtering at 70% detection, and 
horizontal lines represent the correlation with complete data. 


4.2. Utilizing machine learning 


Machine learning (ML) offers a promising avenue for advanced multivariate analyses, such as 
discovery of associations between multiple chemicals and a particular health outcome or prediction 
of a chemical exposure level based on behavior and environmental factors. Despite the promise of 
ML models, they are most powerful in cases where large sample sizes are available. As the 
technology grows in popularity and laboratories become more established in the methodology, many 
studies such as the OR and NY datasets presented here and other studies (e.g. [33]) are reaching 
sample sizes where ML is a viable option. One limitation of ML is that a majority of methods require 
no missing values, requiring researchers to again consider and establish best practices when dealing 
with missing observations due to being below LOD, MI, etc., for wristband data. 

If a researcher uses chemical detection status (i.e. detected or below LOD) as the response 
variable in their model, it is important to note that ML classification model performance can suffer 
when outcome category frequencies are highly imbalanced [34]. In this case, ML models learn 
characteristics of the majority class only when prediction accuracy is being optimized. The use of 
alternative metrics and techniques such as down sampling and upweighting [35] may help alleviate 
this issue if large enough sample sizes are available. When using a quantitative outcome, a large 
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majority of ML methods, particularly those used on smaller sample sizes (e.g. <1,000), assume that 
the response variable approximately follows a normal distribution. Therefore, if a researcher uses 
chemical concentrations as the response variable and imputes below LOD observations, methods 
such as discriminant analysis, naive Bayes, and support vector machines will be inappropriate. Tree- 
based methods such as regression trees [36] and random forest regression [37] do not make 
distributional assumptions and provide more promise for use with wristband data. However, even 
for these models, the ratio of detects and non-detects, the distance between LOD and detected 
values, and the optimization or loss function used must be considered carefully. For example, if the 
mean-squared error is used the model can effectively become a classification model between 
detections and non-detections with no ability to differentiate large differences in concentrations, 
because they are still much smaller than the large distance between LOD and observed 
concentrations. If chemical concentrations are used as predictor variables, some ML also assume 
normally distributed explanatory variables (e.g. discriminant analysis). Random forest regression 
models utilize resampling with replacement to grow multiple regression trees. The importance of 
sample size and even distributional properties, such as number of below LOD observations, becomes 
important as the resampling method may have a difficult time representing the underlying 
distribution well, particularly for bimodal distributions. Fig. 5 shows the original distribution of 
concentrations with half LOD imputed values for chemZ from the NY Pilot dataset. Blue densities 
show 25 resampled distributions drawn by the random forest model. Some random draws represent 
the original distribution well, while others do not sample any below LOD observations at all, 
because of the small sample size. Additional research into nuances of ML methods and 
establishment of best practices is of fundamental importance as study sizes grow and combination 
of study datasets becomes possible. 
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Fig. 5. Density of log2 chemZ concentrations from NY Pilot data (red) and densities from 25 random 
forest resample draws (blue). 


5. Utilizing Data from Multiple Studies 


Combining data from multiple wristband studies is crucial to allow researchers to uncover patterns 
of personal chemical exposure that correlate with potential health impacts across diverse 
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communities. A shared understanding of fundamental wristband data properties and a set of 
statistically sound strategies to analyze the data is needed in order to pave the way for combining 
data from multiple studies. However, there are also several other factors that hinder combining 
wristband data across studies [1, 38, 39]; some are due to research gaps and some are due to 
differences in how researchers report their data. For example, the way that authors report chemical 
concentrations varies. Some authors report chemical mass per entire wristband (e.g. ng/wristband) 
and others report chemical mass per unit mass of the wristband (e.g. ng/g wristband or pmol/g 
wristband) [1]. When these concentration reporting differences are present and wristband masses 
are not reported alongside the data, then this inhibits the comparison of chemical concentrations 
across studies [39]. In addition, more research is needed to understand the variables that influence 
the rate and amount of chemicals entering into wristbands [1], which could lead to strategies to 
normalize wristband data across studies where wristbands were worn for different lengths of time 
and in different environmental conditions. As described in Samon et al., chemical uptake into 
silicone wristbands is not consistent over time and is dependent on each chemical’s physical- 
chemical properties and environmental conditions during the study [1]. Silicone post-deployment 
cleaning and extraction methods also differ between laboratories [3, 39] and it is unknown how 
these differences may affect quantified concentrations across studies in different laboratories. To 
reduce additional sources of variability in the data, researchers need to agree on best practices for 
communicating study protocols to participants to address potential misconceptions up front. For 
example, participants may think that if they wear the wristband longer than requested, they are 
helping the research goal and providing more data, when instead they are introducing more 
variability and complicating interpretation of study results. 

While further research into sources of variability due to differences in study designs is needed, 
how chemical concentrations might be normalized across studies remains an open research question. 
In the meantime, researchers can consider alternate ways to utilize data from multiple studies. For 
example, the comparison of study locations can be made by looking for differences in detection 
frequency, if other important factors are reasonably controlled or equitable between populations. 
However, different analytical methods for chemical identification and quantification have been 
developed and used, even for studies coming out of the same research laboratory. In these cases, the 
chemicals targeted differ. For example, the NY dataset was measured for 61 chemicals, and the OR 
dataset was measured for 94 chemicals. Of these chemicals, a total of 45 were measured in both 
studies. The joining of these datasets would result in 110 chemicals in total, and more missing values 
would be introduced into the joined dataset. However, unlike previous missing data, these missing 
values would be MAR and downstream statistics would need to account for the additional 
mechanism for missing values. When comparing the exposure of individuals between study 
populations, the detection frequency across all chemicals analyzed within each study relative to a 
measure of central tendency summary across other participants in the same study would give a sense 
of total exposure level for an individual. Further, many studies have discretized chemical 
concentration values into categories such as tertiles (e.g. [5, 28]). This concept could be used within 
a given study to derive chemical tertile profiles for each wristband. Then categorical data analysis 
strategies such as multiple correspondence analysis [40] could be used to perform clustering of 
wristband samples across studies to find samples with common patterns. Alternatively, the actual 
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percentile within a study could be recorded and used to visualize and begin to understand exposure 
patterns across studies. However, the treatment of missing values, either due to a study-based MAR 
source or a MNAR below LOD source needs to be carefully considered and treated differently. For 
example, Fig. 6 shows empirical cumulative density curves for chemCB in both the NY and OR 
datasets with tertile thresholds denoted by gray dashed lines. For this chemical, the proportion of 
wristbands with values below LOD in the OR study is greater than 0.33. It would be nonsensical to 
assign some of the wristbands with non-detections to the lower tertile and others to the middle tertile. 
If all wristbands below LOD were assigned to the lower tertile for OR, a researcher would need to 
consider if tertiles composed of different proportions of samples are still comparable, and what 
proportion of LOD observations comparisons are no longer meaningful. 
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Fig. 6. Empirical cumulative density plots of chemCB with tertile thresholds denoted by gray lines. 


6. Recommendations and Conclusions 


The use of silicone wristbands in research studies has rapidly grown over the past few years, 
especially in community-engaged research [1, 6]. Currently, researchers are using a wide range of 
data processing and statistical approaches to analyze data from silicone wristbands with a focus on 
within-study interpretation, but these approaches jeopardize the ability to use the datasets for larger 
meta-analyses. The need for better guidance and established best practices is evident in the examples 
shown here, where minor differences in data handling and modeling can lead to vastly different 
conclusions and interpretation. Some key takeaways and guidance from example analyses presented 
here are as follows: 


e The imputation of half the LOD, or other small constant values, can greatly affect the 
covariance structure of wristband data (Fig. 1). Even when scaling features, imputation of 
below LOD values is not recommended as analyses utilizing the covariance structure will be 
determined by detection rates rather than the intended quantitative information. 

e Many dimension reduction techniques for exploratory data analysis have implementations 
that do not require imputation of missing values, such as projection pursuit PCA. These 
algorithms are preferable to imputation of below LOD values for wristband data. 
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e Wristband data chemical concentrations where below LOD values have been imputed with 
half the LOD do not follow a normal distribution, even when log-transformed, and are 
bimodal. Methods such as linear regression are not appropriate for this data, and alternative 
methods such as Gaussian mixture models [41] should be considered. 

e Missing value treatment strategies must account for the different missing data mechanisms 
in wristbands studies, rather than blindly implementing one imputation method on multiple 
types of missing values (Fig. 2). For example, the use of data-driven imputation algorithms, 
such as MICE, on below LOD observations can result in imputed values considerably larger 
than LOD. 

e Non-parametric hypothesis tests are not equivalent to parametric hypothesis tests at sample 
sizes typical of wristband studies. Although methods such as the Wilcoxon Test can be used 
to skirt the assumption of normally distributed data with half LOD imputed data, even at 
small percentages of missingness, the tests are effectively determined by detection rates 
rather than concentration values (Fig. 4). 


When presented with data from a wristband study or multiple wristband studies, researchers 
should first prioritize evaluating the units reported, amount of missing data the effect on the structure 
of specific chemicals analyzed. If data from multiple studies are present, concentration units should 
be standardized. Further, only chemicals commonly detected between studies should be considered 
in downstream analyses. Sensitivity analyses should then be conducted to determine which 
chemicals should be analyzed using quantitative concentrations or using detected/not-detected 
information based on the detection frequency of each compound. If data is to be used quantitatively, 
concentrations should be log-transformed. Tree-based ML methods, which can handle detection 
information and concentrations simultaneously through LOD imputation, may be considered with 
large enough sample sizes. Sensitivity analyses looking at the resampled distribution of chemicals 
should be examined before considering these ML methods. Finally, if the overall exposure profile 
is of interest, researchers should evaluate if techniques, such as PPCA, provide an interpretable 
reduced dimension option to represent wristband data. Often in wristband studies, chemicals will all 
have loadings in the same direction on one of the principal components leading to one potential 
metric of overall exposure. 

This is the first paper to summarize data properties, current data analysis approaches and their 
issues, and important areas where best practices are needed for wristband data. We demonstrate 
there is a need for standardized and thorough wristband data analysis methods from the research 
community, which will create more opportunities to combine wristband data from multiple studies 
or use meta-analysis procedures, leading to increased data access and interoperability. In addition, 
more research is needed to understand other factors that hinder the combination of data from 
individual wristband studies (e.g. how to normalize for differences in wristband wear time and 
environmental conditions). Overall, a combination of these efforts will enable research to move 
beyond the narrow population focus of individual studies, leading to new discoveries about personal 
chemical exposure and potential impacts to human health. 


182 


Pacific Symposium on Biocomputing 2024 


Acknowledgements 


We thank the study participants for their willingness to engage with our research team. Research 
reported in this publication was supported by the National Institute of Environmental Health 
Sciences (NIEHS) under award numbers R21/R33ES024718, P30ES030287, and P42ES016465. 
The content is solely the responsibility of the authors and does not necessarily represent the official 
views of the NIEHS. PNNL is a multi-program laboratory operated by Battelle for the U.S. 
Department of Energy under contract DEAC05-76RL01830. 


Data and Code Availability 


e De-identified wristband data is available for download at 
https://data.pnnl.gov/group/nodes/dataset/33672. 

e R code to reproduce all analyses and plots is on GitHub at https://github.com/PNNL- 
Superfund-Research-Center/PSB_ Wristband Analyses/. 


Declaration of Competing Interest 


Kim A. Anderson and Diana Rohlman, authors of this research, disclose a financial interest in 
MyExposome, Inc., which is marketing products related to the research being reported. The terms 
of this arrangement have been reviewed and approved by Oregon State University in accordance 
with its policy on research conflicts of interest. The authors have no other relevant financial or non- 
financial interests to disclose. 


References 


1.Samon SM, Hammel SC, Stapleton HM, Anderson KA. Silicone wristbands as personal passive 
sampling devices: current knowledge, recommendations for use, and future directions. Environ Int. 
2022: 107339. 


2.0'Connell SG, Kincl LD, Anderson KA. Silicone wristbands as personal passive samplers. 
Environ Sci Technol. 2014; 48(6): 3327-35. 


3.Dixon HM, Poutasse CM, Anderson KA. Silicone wristbands and wearables to assess chemical 
exposures. In: Phillips K, Yamamoto D, Racz L, editors. Total exposure health: An introduction. 
First ed: CRC Press; 2020. p. 139-60. 


4.Anderson KA, Points II] GL, Donald CE, Dixon HM, Scott RP, Wilson G, et al. Preparation and 
performance features of wristband samplers and considerations for chemical exposure assessment. 
J Expo Sci Environ Epidemiol. 2017; 27: 551. 


5.Dixon HM, Bramer LM, Scott RP, Calero L, Holmes D, Gibson EA, et al. Evaluating predictive 
relationships between wristbands and urine for assessment of personal PAH exposure. Environ Int. 
2022; 163: 107226. 


6.Rohlman D, Samon S, Allan S, Barton M, Dixon H, Ghetu C, et al. Designing equitable, 
transparent community-engaged disaster research. Citizen science: theory and practice. 2022; 7(1). 


183 


Pacific Symposium on Biocomputing 2024 


7.Dixon HM, Armstrong G, Barton M, Bergmann AJ, Bondy M, Halbleib ML, et al. Discovery of 
common chemical exposures across three continents using silicone wristbands. Royal Society Open 
Science. 2019; 6(2): 181836. 


8.Dixon HM, Scott RP, Holmes D, Calero L, Kincl LD, Waters KM, et al. Silicone wristbands 
compared with traditional polycyclic aromatic hydrocarbon exposure assessment methods. Anal 
Bioanal Chem. 2018; 410(13): 3059-71. 


9.Anderson KA, Szelewski MJ, Wilson G, Quimby BD, Hoffman PD. Modified ion source triple 
quadrupole mass spectrometer gas chromatograph for polycyclic aromatic hydrocarbon analyses. J 
Chromatogr A. 2015; 1419: 89-98. 


10.R Development Core Team. R: A language and environment for statistical computing. Vienna, 
Austria: R Foundation for Statistical Computing; 2023. 


11.Donald CE, Scott RP, Blaustein KL, Halbleib ML, Sarr M, Jepson PC, et al. Silicone wristbands 
detect individuals' pesticide exposures in West Africa. Royal Society Open Science. 2016; 3(8): 
160433. 


12.Hammel SC, Hoffman K, Webster TF, Anderson KA, Stapleton HM. Measuring personal 
exposure to organophosphate flame retardants using silicone wristbands and hand wipes. Environ 
Sci Technol. 2016; 50(8): 4483-91. 


13.Kassotis CD, Herkert NJ, Hammel SC, Hoffman K, Xia Q, Kullman SW, et al. Thyroid receptor 
antagonism of chemicals extracted from personal silicone wristbands within a papillary thyroid 
cancer pilot study. Environ Sci Technol. 2020; 54(23): 15296-312. 


14.Paulik LB, Hobbie KA, Rohlman D, Smith BW, Scott RP, Kincl L, et al. Environmental and 
individual PAH exposures near rural natural gas extraction. Environ Pollut. 2018; 241: 397-405. 


15.Webb-Robertson B-JM, Matzke MM, Metz TO, McDermott JE, Walker H, Rodland KD, et al. 
Sequential projection pursuit principal component analysis—dealing with missing data associated 
with new-omics technologies. BioTechniques. 2013; 54(3): 165-8. 


16.Hartigan JA, Wong MA. Algorithm AS 136: A k-means clustering algorithm. Journal of the 
royal statistical society series c (applied statistics). 1979; 28(1): 100-8. 


17.Rousseeuw PJ. Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. 
Journal of computational and applied mathematics. 1987; 20: 53-65. 


18.Doherty BT, McRitchie SL, Pathmasiri WW, Stewart DA, Kirchner D, Anderson KA, et al. 
Chemical exposures assessed via silicone wristbands and endogenous plasma metabolomics during 
pregnancy. J Expo Sci Environ Epidemiol. 2022; 32(2): 259-67. 


19.Stekhoven DJ, Bühlmann P. MissForest—non-parametric missing value imputation for mixed- 
type data. Bioinformatics. 2012; 28(1): 112-8. 


184 


Pacific Symposium on Biocomputing 2024 


20.Van Buuren S, Groothuis-Oudshoom K. mice: Multivariate imputation by chained equations in 
R. Journal of statistical software. 2011; 45: 1-67. 


21.Lazar C, Burger T. imputeLCMD: a collection of methods for left-censored missing data 
imputation. R package, version. 2022; 2.1. 


22.Poutasse CM, Haddock CK, Poston WSC, Jahnke SA, Tidwell LG, Bonner EM, et al. Firefighter 
exposures to potential endocrine disrupting chemicals measured by military-style silicone dog tags. 
Environ Int. 2022; 158: 106914. 


23.Wang S, Romanak KA, Hendryx M, Salamova A, Venier M. Association between thyroid 
function and exposures to brominated and organophosphate flame retardants in rural central 
appalachia. Environ Sci Technol. 2019; 54(1): 325-34. 


24.Webb-Robertson B-JM, McCue LA, Waters KM, Matzke MM, Jacobs JM, Metz TO, et al. 
Combined statistical analyses of peptide intensities and peptide occurrences improves identification 
of significant peptides from MS-based proteomics data. J Proteome Res. 2010; 9(11): 5748-56. 


25.Reddam A, Tait G, Herkert N, Hammel SC, Stapleton HM, Volz DC. Longer commutes are 
associated with increased human exposure to tris (1, 3-dichloro-2-propyl) phosphate. Environ Int. 
2020; 136: 105499. 


26.Rothman KJ, Greenland S, Lash TL. Modern epidemiology: Wolters Kluwer Health/Lippincott 
Williams & Wilkins Philadelphia; 2008. 


27.Levasseur JL, Hoffman K, Herkert NJ, Cooper E, Hay D, Stapleton HM. Characterizing 
firefighter's exposure to over 130 SVOCs using silicone wristbands: A pilot study comparing on- 
duty and off-duty exposures. Sci Total Environ. 2022; 834: 155237. 


28.Hammel SC, Phillips A, Hoffman K, Stapleton HM. Evaluating the use of silicone wristbands to 
measure personal exposure to brominated flame retardants. Environ Sci Technol. 2018. 


29.Romano ME, Gallagher L, Doherty BT, Yeum D, Lee S, Takazawa M, et al. Inter-method 
reliability of silicone exposome wristbands and urinary biomarker assays in a pregnancy cohort. 
Environ Res. 2022; 214: 113981. 


30.Bergmann AJ, North PE, Vasquez L, Bello H, Ruiz MdCG, Anderson KA. Multi-class chemical 
exposure in rural Peru using silicone wristbands. J Expo Sci Environ Epidemiol. 2017; 27(6): 560- 
8. 


31.Quintana PJ, Hoh E, Dodder NG, Matt GE, Zakarian JM, Anderson KA, et al. Nicotine levels in 
silicone wristband samplers worn by children exposed to secondhand smoke and electronic cigarette 
vapor are highly correlated with child’s urinary cotinine. J Expo Sci Environ Epidemiol. 2019; 
29(6): 733-41. 


32.Divine G, Norton HJ, Hunt R, Dienemann J. A review of analysis and sample size calculation 
considerations for Wilcoxon tests. Anesth Analg. 2013; 117(3): 699-710. 


185 


Pacific Symposium on Biocomputing 2024 


33.Samon S, Rohlman D, Tidwell L, Hoffman P, Oluyomi A, Walker C, et al. Determinants of 
exposure to endocrine disruptors following hurricane Harvey. Environ Res. 2023; 217: 114867. 


34.Japkowicz N, Stephen S. The class imbalance problem: A systematic study. Intelligent data 
analysis. 2002; 6(5): 429-49. 


35.Susan S, Kumar A. The balancing trick: Optimized sampling of imbalanced datasets—A brief 
survey of the recent State of the Art. Engineering Reports. 2021; 3(4): e12298. 


36.Breiman L. Classification and regression trees. First ed: Routledge; 1984. 
37.Breiman L. Random forests. Machine learning. 2001; 45: 5-32. 


38.0’Connell SG, Anderson KA, Epstein MI. Determining chemical air equivalency using silicone 
personal monitors. J Expo Sci Environ Epidemiol. 2022; 32(2): 268-79. 


39.Wactawik M, Rodzaj W, Wielgomas B. Silicone wristbands in exposure assessment: analytical 
considerations and comparison with other approaches. Int J Env Res Public Health. 2022; 19(4): 
1935. 


40.Le Roux B, Rouanet H. Geometric data analysis: from correspondence analysis to structured data 
analysis: Springer Science & Business Media; 2004. 


41.Reynolds DA. Gaussian mixture models. Encyclopedia of biometrics. 2009; 741(659-663). 


186 


Pacific Symposium on Biocomputing 2024 


Subject Harmonization of Digital Biomarkers: Improved Detection of Mild 
Cognitive Impairment from Language Markers 


Bao Hoang!?!, Yijiang Pang!!, Hiroko H. Dodge?, Jiayu Zhou! 


l Department of Computer Science and Engineering, Michigan State University, East Lansing, MI 
48824, USA 
2 Department of Neurology, Massachusetts General Hospital, Harvard Medical School, Boston, 
MA02129, USA 
t Corresponding E-mail: jiayuz@msu. edu 
t Equal contribution 


Mild cognitive impairment (MCI) represents the early stage of dementia including 
Alzheimer’s disease (AD) and is a crucial stage for therapeutic interventions and treat- 
ment. Early detection of MCI offers opportunities for early intervention and significantly 
benefits cohort enrichment for clinical trials. Imaging and in vivo markers in plasma and 
cerebrospinal fluid biomarkers have high detection performance, yet their prohibitive costs 
and intrusiveness demand more affordable and accessible alternatives. The recent advances 
in digital biomarkers, especially language markers, have shown great potential, where vari- 
ables informative to MCI are derived from linguistic and/or speech and later used for predic- 
tive modeling. A major challenge in modeling language markers comes from the variability 
of how each person speaks. As the cohort size for language studies is usually small due 
to extensive data collection efforts, the variability among persons makes language markers 
hard to generalize to unseen subjects. In this paper, we propose a novel subject harmo- 
nization tool to address the issue of distributional differences in language markers across 
subjects, thus enhancing the generalization performance of machine learning models. Our 
empirical results show that machine learning models built on our harmonized features have 
improved prediction performance on unseen data. The source code and experiment scripts 


are available at https://github.com/illidanlab/subject_harmonization 


Keywords: Mild Cognitive Impairment; Harmonization Algorithm 


1. Introduction 


Alzheimer’s disease (AD) is a major type of dementia and ranks as the seventh-leading cause 
of death in the United States in 20201! Mild Cognitive Impairment (MCI) is the prodromal 
stage of dementia, including AD, characterized by minor problems with memory, language, or 
judgment. Early detection of MCI is critical for early intervention and cohort enrichment. In 
vivo biomarkers such as AG-amyloid identified by cerebrospinal fluid A842 or PET amyloid 
imaging are sensitive to the early or pre-clinical stage. Yet, it is not easily accessible nor 
affordable for massive screening of general older adults, especially those with limited healthcare 
access. 

Recently developed digital biomarkers have offered an affordable and non-intrusive alter- 


© 2023 The Authors. Open Access chapter published by World Scientific Publishing Company and 
distributed under the terms of the Creative Commons Attribution Non-Commercial (CC BY-NC) 
4.0 License. 
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native. Especially language markers24l linguistic and speech variables derived from conver- 
sations, both structured? or semi-structured have shown a significant correlation with the 
cognitive capability of the subjects and are recently used for MCI detection Digital biomark- 
ers are generally derived and utilized in a data-driven fashion. For example, language markers 
are derived from carefully designed cohort#4! to build predictive models that take language 
features as input and clinical variables as output. 

One significant challenge of digital biomarkers is the limited cohort sample size, where 
specially designed collection protocols and devices must be deployed for data collection. For 
example, in the studies of language markers, the LCONECT study! collected semi-structured 
conversation data from 74 subjects in a five-year clinical trial, and the ADReSS data from 
DementiaBank has spontaneous speech of 158 subjects As the small sample size greatly 
limits the machine learning models that can be used for analysis, a standard to enrich the 
sample size is constructing multiple data points from the same subject and associated with 
the same clinical label of the subject as the prediction target. In sensor studies, for example, 
by using a fixed time window, multiple time series are derived from the same subject as data 
points BP] Another example is in language marker studies, where linguistic and speech markers 
are derived from one conversation, and thus multiple conversations from the same subject are 
treated as different data points24 

Even though these treatments greatly increased the sample size for predictive modeling, 
they have violated the basic assumption of most analytic approaches, that data points should 
be independent and identically distributed (i.id.). The non-i.i.d. is complicated by another 
challenge of digital biomarkers, which usually have high individual variability compared to 
other biomarkers, leading to unstable prediction performance and poor generalization per- 
formance to unseen subjects Again use language markers as an example: the way people 
speak can be drastically different, and such differences are much more outstanding than subtle 
differences characterizing cognitive capabilities. The intuitive idea is to harmonize the distri- 
butional bias from subjects, similar to the harmonization that removes confounding factors 
from demographic data or eliminates batch effects. However, subject harmonization has dras- 
tically distinguished itself from eliminating typical confounding variables: the subjects in the 
testing/inference stage are not accessible during the training, and the embedding of subject 
information is implicit and may be non-linearly correlated with multiple dimensions in the 
original feature representations. Therefore, the existing harmonization approach cannot be 
used to quantify and remove the subject effects. 

In this paper, we propose a novel framework for subject harmonization. The proposed 
approach uses an auxiliary classification task on the subjects to learn a deep harmonization 
network, which eliminates both linear and non-linear effects in differentiating subjects. Our 
empirical results show that the language markers harmonized by the proposed approach can 
improve MCI detection performance. 


2. Related Works 


Detection of MCI. There are many approaches developed for detecting MCI using a com- 
bination of clinical information brain imaging [285 and genetics#62 For example, machine 
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learning models built on brain imaging such as MRI and FDG PET have been shown effective 
for capturing structural and metabolism information of the brain and are strongly associated 
with the development of ADET] Yet these biomarkers are often expensive and instructive, 
making them hard to screen general older adults. More recently, digital biomarkerg2“! have 
offered a promising affordable, and non-intrusive alternative for broader adoption. The de- 
velopment of language markers is still in its early stage. Digital markers derived from the 
behavior are highly variable and different language markers derived from limited data often 
yield unstable detection models and are hard to generalize to unseen populations. 

Data Harmonization. A fundamental challenge of data analysis is the harmonization of con- 
founding variables, i.e., eliminating the effects from confounding variables P0 With explicit 
confounding variables, common harmonization approaches eliminate confounding variables’ 
influence on the original input features or output PE? Recent deep learning models require 
the harmonization of non-linear effects, leading to the development of end-to-end frameworks 
that cooperate with the task prediction loss and a penalty loss that usually minimizes depen- 
dence between confounders and prediction outcomes2#26 Meanwhile, fair machine learning 
schemes exploit distributional robust optimization to control implicit demographic confound- 
ing effects (bias) 229 From another aspect, the underlying variables can be considered as some 
strong signal in the original features but is irrelevant to our prediction goal, then feature engi- 
neering helps reduce the effects’ Most existing harmonization approaches need confounding 
variables to be accessible during the training and secure the generalization to unseen groups. 
However, in digital biomarker studies where subjects are treated as a confounding variable, 
the challenging arises when testing subjects are not seen during the training and demands a 
generalizable harmonization on subjects. 


3. Methods 
3.1. Data 


We use semi-structured conversational data from a clinical trial -CONECT 
NCT02871921). The data is available upon request at This clinical 
trial aims to investigate the potential benefits of regular video chat conversations on the 
cognitive functions and psychological well-being of individuals aged 75 and older. The dataset 
has 6771 conversation sessions from 74 participants, with 36 participants being cognitively 
normal (NL) and 38 diagnosed with mild cognitive impairment (MCI). Each conversational 
session is about 30 minutes in length. Table[i]shows the participants’ demographic information. 


Table 1. Demographics of Participants 


Variable All (n = 74) NL (n = 36) MCI (n = 38) 
Age 80.7 + 4.6 79.7+3.9 81.7 +5.0 
Gender (% women) 71.6 77.8 65.8 
Years of education 15.2 + 2.5 15.4+ 2.5 15.142.5 


Number of Conversations 91.5 + 37.2 92.4 + 35.8 90.7 + 38.4 
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3.2. Language Markers 


We derived a total of 99 feature variables for each conversation as language markers, includ- 
ing four types: Linguistic Inquiry and Word Count (LIWC), Syntactic Complexity, Lexical 
Diversity, and Response Length. 

Linguistic Inquiry and Word Count (LIWC): For the LIWC feature variables, we use the 
2007 English version of Linguistic Inquiry and Word Count?! This tool categorizes English 
words into 64 different “LIWC categories”. These categories cover a wide range of linguistic, 
psychological, and topical aspects, enabling us to gain insights into various social, cognitive, 
and affective processes. To obtain the LIWC features, we follow! We first generate a 64- 
dimensional LIWC feature vector for every word in each conversation, with each dimension 
corresponding to a specific LIWC category (1 = word belongs to the category, 0 = word does 
not belong); we then sum over the feature vectors of all words in the conversation, resulting in 
a single 64-dimensional feature vector representing the linguistic feature of that conversation. 
Syntactic complexity represents the range and intricacy of grammatical structures employed 
in language production®2! We used the L2 Syntactic Complexity Analyzer#! to extract the 
syntactic complexity feature. This tool is specifically designed to automate the analysis of 
syntactic complexity in English language texts produced by advanced learners of English. We 
extract a 23-dimensional vector from each conversation representing the syntactic complexity 
of conversation, with each dimension corresponding to a specific English syntactic complexity 
measure from the tool. 

Lexical Diversity is the range of different words within a given text, wherein a wider range 
indicates greater diversity 84 Given a text input, lexical diversity has been measured using the 
type-token ratio (TTR)B obtained by dividing the total number of unique words by the overall 
word count. To adopt this in our study, we extract the TTR from participants’ conversational 
responses, as well as its variations, such as the moving average type-token ratio (MATTR PS 
and the mean segmental type-token ratio (MSTTR). We also use additional lexical diversity 
measures, including the Hypergeometric distribution D (HD-D) and the measure of textual 
Lexical Diversity (MTLD) 27 In total, we derive a 10-dimensional vector representing con- 
versations’ lexical diversity, with each dimension corresponding to one of the aforementioned 
lexical diversity measures and its respective variation. 

Response length: Our analysis suggests that NL individuals tend to provide lengthier responses 
to questions posed by interviewer than MCI individuals, showing great potential for distin- 
guishing between MCI and NL individuals. We extract the mean and variance of participants’ 
response lengths within each conversation. 


3.3. Generalized Least Squares 


Generalized least squares is a widely used harmonization approach to remove linear ef- 
fects given confounding variables, such as age, gender, and education PIBS] For each con- 
versation’s extracted language marker features x;, we assume that these features are lin- 
early biased by three confounding variables age, sex, and education of the subject, denoted 
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ci = [age, sex, education], such that: 


r= Ww: cr + yrärmonized 


where w is weight matrix and gřarmonized ig our goal harmonized language markers. The objec- 


tive function for generalized least square method is given by: 
n 
min So (we; — 2,)* 
i=1 


After obtaining weight matrix w by solving the above objective function, the harmonized 
language markers is derived by: 


qs =2;—w- c? 


3.4. Subject Harmonization for Non-linear Predictive Modeling 


Unlike other types of in-vivo biomarkers, digital markers show great individual variability. In 
language markers, for instance, how one speaks a language can differ greatly, even if they are 
all native speakers. The differences can be visualized by checking the distributions of language 
features. Our empirical results in Sec. [4.1]show that the feature variables have clear clustering 
structures w.r.t. subjects. As such, successful analysis and predictive modeling need careful 
harmonization to eliminate individual variability. Generalized least squares’s harmonization 
mechanism eliminates the linear subspace that is predictive of these confounding variables 
and uses the orthogonal complement subspace as the harmonized features. Though all linear 
effects are removed through the harmonization approach, the approach does not remove any 
non-linear effects from data. For example, if the multiplication of two confounding variables 
(e.g., age and gender) has effects on the data, such effects will not be removed and will be 
picked up by non-linear models such as random forest and deep learning models. Another 
challenge comes from the generalization of harmonization, where digital biomarkers demand 
a unique harmonization procedure that can be generalized to unseen subjects. 

To address the above challenge, we propose a deep harmonization network to facilitate an- 
alytics with digital biomarkers. In the context of the prediction of MCI from language markers, 
we are given a set of conversations collected from a set of different subjects and we would like 
to build a predictive model for MCI using these conversations. We follow the last section to 
extract features for each conversation and form a feature vector for each conversation. The 
setting of predictive modeling is to classify each conversation/feature vector into a label (MCI 
or not), which will be later aggregated into a prediction of the subject. The feature vectors of 
one subject will be either used in training or testing but not both. The goal of harmonization 
is to remove the confounding factor of subjects in the feature vectors. The proposed approach 
has two stages: in the first stage, we construct an auxiliary task to learn the deep harmoniza- 
tion network; in the second, the learned harmonization network is used to transform the data 
points, and the harmonized data is then used for building a downstream classifier of MCI. 

The design of a deep harmonization network is based on two intuitions: 1) a good har- 
monization should remove all linear and non-linear effects from subjects, and therefore the 
harmonized features should not be able to differentiate subjects under deep models; 2) the 
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harmonized features should be as close to the original feature as possible (otherwise, the 
harmonization admits a trivial solution where all features are wiped and set to the same 
value). Following these intuitions, the proposed approach seeks to minimize the subject differ- 
entiation between data points obtained from different subjects and minimizes the differences 
between harmonized and original language markers. Generally, for M pairs of extracted lan- 
guage features and corresponding subject labels (x;,y?), we denote fpu(-) : x —> x as the 
feature harmonization network parameterized with Ory, fs(-) :X > s as the auxiliary subject 
classifier parameterized with 6,. The composite function fso fry denotes a classifier f, using 
harmonized features fry. The objective for learning feature harmonization is given by: 


min ag Ypo lene fs © Fie i) y?) + linsel Fir) x), (1) 
where ¢en:(-) is the cross-entropy loss and minimizing —fent(:) encourages the harmonized 
features cannot be differentiated by subject identities, and ¢mse(-) is the mean square error 
which encourages the similarity between the original features and the harmonized features. 
Note that we do not restrict the type of classifier to be used in f,, but a non-linear model is 
preferred due to the design of deep harmonization. In our study, we use a 3-layer MLP for the 
harmonization network. 


3.5. MCI Detection using Harmonized Features 


After the harmonization process, we use the harmonized features with confounding effects re- 
moved for the downstream task of MCI detection. The MCI detection can be modeled by two 
classification tasks: a) conversation classification that identifies whether a given conversation 
is from an MCI subject or an NL subject using language markers extracted from the conver- 
sation, and b) subject classification, which collectively uses the results from the conversation 
classification on conversations from one subject and predict if a subject is an MCI subject 
or an NL subject. We model conversation classification as a standard machine learning task 
that seeks a classifier that takes language markers as an input and outputs a binary predic- 
tion. Formally, we have M pairs of extracted features and corresponding cognitive status label 
(xi, yf). We denote fi,(-) : x > t as the MCI classifier parameterized with 6;. In our study, we 
use two classifiers: a linear model (logistic regression, LR) and a non-linear model (2-layer 
multi-layer perceptron, MLP). Then, the objective function for cognitive status classification 
is formulated as: 


2 i M c 
E M a L ft o fru(xi), yi), 


where ¢(-) is the binary cross entropy loss. To achieve subject classification, we use a majority 
vote strategy so that if more than 50% of a subject’s conversations are predicted as MCI 
by the conversation classifier, we classify that subject as MCI and NL otherwise. For both 
settings, we randomly sample 80% subjects as train subjects and the remaining subjects as 
test subjects. The conversations from training subjects are used to train the conversation 
classifier. The complete framework is illustrated in Figure 
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Stage I: Train Feature Harmonization Network 
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Stage Il: MCI Detection after Harmonization 
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Fig. 1. The proposed subject harmonization process includes two stages. In the first stage, we 
train a deep harmonization network using an auxiliary subject classification task, which discourages 
differentiation among subjects and meanwhile retains the similarity between the original features and 
the harmonized ones. In the second stage, we fix the harmonization model and use the harmonized 
features to train the main learning task, i.e., the detection of MCI. 


4. Experimental Results and Analysis 
4.1. Effectiveness and Generalizability of Subject Harmonization 


The design of harmonization is to remove the confounding factor of the variable of subjects. 
Therefore, we investigate the prediction power towards subjects using features before and 
after harmonization. The stronger the confounding variable, the better the features’ predic- 
tion power differentiating subjects. A successful harmonization should greatly eliminate such 
prediction power. 

In this experiment, conversations from individual subjects are assigned the same labels, 
while conversations from different subjects are assigned distinct labels. For example, all con- 
versations from the first subject have the label 1, and all those from the second subject have 
the label 2. With a total of 74 subjects, we have 74 unique labels. We randomly split data 
(original or harmonized) into training and testing, with 80% of conversations for training and 
20% for testing. We build a linear classifier (Logistic Regression) and a deep classifier (Multi- 
layer perceptron) using the training data and evaluate the performance in terms of accuracy 
using the data. For the harmonization network, we use a 3-layer Multi-layer Perceptron. We 
repeat the experiment for 100 random seeds, and report the average accuracy of predicting 
testing conversations’ subject labels before and after harmonization in table |2| We use the 
same training and testing conversations for each random seed while evaluating before and af- 
ter harmonization. We see a substantial decrease in subject classification performance in both 
models, showing the effectiveness of the harmonization design that removes the confounding 
variables’ linear and non-linear effects. 

We conduct a qualitative study that visualizes the distributions of the language markers 
before and after the subject harmonization in Figure [2| We use t-SNH22 to plot the 99- 
dimensional language markers in a comprehensible 2-dimensional space, where conversations 
from the same subjects are assigned matching colors. From the visualization, we see that data 
points from the same subjects show a clear clustering structure of subjects, indicating subject 
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Table 2. Performance of subject classification tasks before and after 
subject harmonization. 


Classifier Before harmonization After harmonization 
Logistic Regression 0.921+0.007 0.221+0.012 
Multi-layer Perceptron 0.905+0.007 0.219+0.038 
Before Harmonization After Harmonization 
e 
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Fig. 2. The visualization of language markers extracted from conversations collected from 10 ran- 


domly selected subjects before and after subject harmonization. We see that a clear clustering struc- 
ture exists before subject harmonization, which is successfully destroyed by the harmonization. 


bias in the language markers. After the harmonization, such clustered structure is visually 
destroyed, showing the effectiveness of the purpose harmonization strategy. 


4.2. MCI Detection via Harmonized Language Markers 


We now investigate the predictive power of language markers in detecting MCI subjects. 
We compare a set of different harmonization approaches: a) generalized least squares PBS] 
commonly used for harmonizing linear effects and used age/gender/education as confounding 
variables; b) the proposed deep subject harmonization, which harmonizes against the subject 
variable but does not use demographic variables (age/gender/education); c) deep harmoniza- 
tion that does not use subject information and jointly harmonizes all demographic variables. 
d) deep harmonization approaches that harmonize only individual demographic variables. 

When harmonizing demographic variables using a deep harmonization network, we con- 
struct category variables from age/gender/education (e.g., age between 75-79 as category 1, 
age between 80-84 as category 2) and train equation |1| We repeat the experiments for 100 
random seeds and report the average and standard deviation of Area under the ROC curve 
(AUC), F1, Sensitivity, and Specificity on the test data in Table 

From the results, we find the following: 1) The non-linear model MLP using features 
from deep subject harmonization, which harmonizes the subject variable using a deep model, 
provides the best downstream classification performance on both conversation and subject 
predictions. 2) Both the linear and non-linear models benefit more from deep subject har- 
monization than generalized least squares. 3) For MLP, deep harmonization on demographic 
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Table 3. Performance of two cognitive status classification tasks over different harmonization methods. 


aa Perf i 
Method for harmonization Task Da tiene 


Classifier AUC F1 Sensitivity Specificity 


Conversation classification 


LR 0.583+0.098 0.557+0.092 0.570+0.123 0.557+0.101 
MLP 0.594+0.092 0.556+0.088 0.545+0.116 0.611+0.091 
LR 0.567+0.110 0.537+0.104 0.538+0.134 0.570+0.119 
MLP 0.545+0.109 0.522+0.103 0.51640.132 0.574+0.125 
Deep harmonization - subject LR 0.640+0.097 0.581+0.089 0.575+0.129 0.625+0.132 

(Proposed method) MLP 0.646+40.092 0.558+0.101 0.541+0.136 0.640+0.126 


None 


Generalized least square 


E MLP 0.52740.120 0.517+0.119 0.593+0.227 0.427+0.235 

(- age & gender & education year) 
Deep harmonization - age MLP 0.596+0.107 0.538+0.101 0.535+0.166 0.608+0.178 
Deep harmonization - gender MLP 0.554+0.110 0.551+0.110 0.635+0.209 0.426+0.208 
Deep harmonization - education year MLP 0.611+0.102 0.589+0.080 0.654+0.141 0.477+0.165 


Subject classification 


LR 0.591+0.124 0.579+0.126 0.593+0.166 0.568+0.169 
MLP  0.626+0.122 0.59340.124 0.576£0.153 0.649+0.159 
LR 0.585+0.129 0.529+0.148 0.519+40.187 0.601+0.164 

a Ayal 
Generalized least squares?! MLP  0.568+0.122 0.568+0.138 0.565+0.175 0.605+0.175 
Deep harmonization - subject LR 0.649+0.121 0.592+0.115 0.575+0.157 0.652-40.162 
(Proposed method) MLP  0.657+0.113 0.57140.118 0.546+0.152 0.655+0.152 


None 


Deep harmonization 


(- age & gender & education year) MLP 0.5388+0.148 0.539+0.165 0.637+0.272 0.38140.282 


Deep harmonization - age MLP 0.614+0.122 0.577+0.133 0.585+0.205 0.603+0.217 
Deep harmonization - gender MLP 0.571+0.128 0.579+0.139 0.676+0.230 0.409+0.244 
Deep harmonization - education year MLP 0.6389+0.122 0.632+0.091 0.736+0.159 0.417+0.218 


Abbreviations: LR, Logistic Regression; MLP, Multi-layer Perceptron. 


variables performs worse than generalized least squares, even though both jointly harmonize 
against all three demographic variables. 


4.3. Performance on Different Sub- Populations 


Table [4] presents the performance of conversation and subject classification on different sub- 
populations, i.e., different gender groups, education levels, and age groups. By zooming in 
on the performance of different sub-population groups, we want to inspect how the proposed 
subject harmonization impacts these groups, given that demographic variables are not used in 
the harmonization process. From the results, we see that the proposed subject harmonization 
consistently improved the performance of most groups, with the exception of 1) the higher 
educated group (Edu years 19-21), for both conversation and subject classification, and 2) 
minor performance drop in the Male group for the subject classification. 
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Table 4. Performance of two cognitive status classification tasks before and after the harmonization 
methods. 


Performance compairsion 


Groups Before harmonization After harmonization 


AUC Sensitivity Specificity AUC Sensitivity Specificity 


Conversation classification 


Male 0.5383+0.185 0.517+0.199 0.537+0.213 0.564+0.199 0.65640.228 0.409+0.257 
Female 0.621+0.112 0.554+0.140 0.641+0.103 0.67340.106 0.475+0.181 0.72840.143 
Edu 12-15 0.483+0.162 0.529+0.182 0.447+0.165 0.618+0.186 0.586+0.205 0.599+0.234 
Edu 16-18 0.621+0.163 0.490+0.185 0.715+0.110 0.668+0.146 0.452+0.241 0.735+0.160 
Edu 19-21 0.857+0.096 0.790+0.182 0.743+0.161 0.732+0.323 0.647+40.418 0.498+0.257 
Age 75-80 0.608+0.123 0.532+0.158 0.648+0.096 0.638+0.111 0.519+0.169 0.664+0.130 
Age 81-87 0.500+0.231 0.483+0.224 0.52940.317 0.517+0.309 0.456+0.263 0.512+0.369 
Age 88-94 0.781+0.189 0.918+0.157 0.33940.129 0.941+0.058 0.987+0.026 0.386+0.293 


Subject classification 


Male 0.589+0.275 0.537+0.299 0.587+0.365 0.577+0.277 0.64140.292 0.384+0.392 
Female 0.653+0.152 0.600+0.184 0.665+0.187 0.69140.158 0.49140.204 0.75140.184 
Edu 12-15 0.480+0.211 0.530+0.226 0.37740.291 0.624+0.215 0.601+0.218 0.60340.247 
Edu 16-18 0.694+0.241 0.549+0.327 0.828+0.199 0.699+0.228 0.445+0.295 0.75640.221 
Edu 19-21 1.000+0.000 0.929+0.258 0.921+0.260 0.754+0.395 0.607+0.457 0.508+0.445 
Age 75-80 0.654+0.176 0.561+0.206 0.715+0.185 0.67140.153 0.512+0.212 0.699+0.158 
Age 81-87 0.515+0.309 0.501+0.331 0.569+0.431 0.541+0.379 0.474+0.320 0.536+0.444 
Age 88-94 0.953+0.192 0.984+0.087 0.141+0.336 0.984+0.087 1.000+0.000 0.328+0.426 


4.4. Important Language Markers Before and After Harmonization 


In this section, we investigate the feature importance and compare the top language markers 
before and after harmonization. For linear models, feature importance can be directly derived 
from the model weights, and for non-linear MLP models used in this paper, we do not have 
such a straightforward way of getting them. We adopt commonly used permutation feature 
importance to estimate the feature importance. We permute each feature’s values and sub- 
sequently feed the modified dataset into our pipeline. After that, we derive the AUC score 
for both conversation and subject classification using this permutated dataset. The feature 
importance of a feature is then determined by computing the difference between the AUC val- 
ues obtained from the original dataset and the permuted dataset. A larger decrease in AUC 
indicates higher importance of the respective feature in the classification model. 

In table [5] we present the top 10 language features before and after the feature harmoniza- 
tion for both conversation and subject classification. We see that: 1) top features differ quite 
much before and after harmonization. Notably, we see “Nonfluencies” being the most impor- 
tant feature after harmonization, which better supports the pathology of dementia, where 
dementia (even at the preclinical stage) may impact a subject, making it harder to find the 
right words and therefore showing a higher number of nonfluencies during communication. 2) 
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Table 5. Top 10 language features before and after harmonization where the importance is 
w.r.t. the decreasing of AUC in both conversation classification and subject classification. 


Before harmonization After harmonization 
Feature name Type AUC drop Feature name Type AUC drop 
Conversation classification 
Negations LIWC 0.02749  Nonfluencies LIWC 0.00616 
lst pers plural LIWC 0.00587 Assent LIWC 0.00471 
Discrepancy LIWC 0.00495 Insight LIWC 0.00468 
Assent LIWC 0.00328 Affective processes LIWC 0.00455 
Family LIWC 0.00325 T-unit per sentence SC 0.00455 
Tentative LIWC 0.00324 3rd pers singular LIWC 0.00439 
Sexual LIWC 0.00297 Causation LIWC 0.00435 
Auxiliary verbs LIWC 0.00238 Certainty LIWC 0.00418 
Home LIWC 0.00215 Mean length of sentence SC 0.00414 
Inhibition LIWC 0.00204 Hear LIWC 0.00395 
Subject classification 
Negations LIWC 0.03469 = ‘T-unit per sentence SC 0.01203 
Tentative LIWC 0.00562 Mean length of sentence SC 0.00969 
Family LIWC 0.00547 Negations LIWC 0.00922 
Textual lexical diversity LD 0.00531 Clause SC 0.00906 
Home LIWC 0.00438 Affective processes LIWC 0.00859 
Social processes LIWC 0.00391 Causation LIWC 0.00844 
lst pers plural LIWC 0.00359 Cognitive processes LIWC 0.00828 
Assent LIWC 0.00344 Positive emotion LIWC 0.00813 
Personal pronouns LIWC 0.00313 Inclusive LIWC 0.00813 
Discrepancy LIWC 0.00313 Motion LIWC 0.00750 


Abbreviations: LIWC, Linguistic Inquiry and Word Count; SC, Syntactic Complexity; LD, Lexical 
Diversity. 


more syntactic complexity features appear after harmonization for subject classification. The 
top features “T-unit per sentence” and “mean length of sentence” directly correlate to the 
language capability of constructing longer features. 


5. Discussion 


In this paper, we propose a subject harmonization algorithm to mitigate the distributional 
difference of digital biomarkers induced by subject variability. Our empirical results show that 
applying subject harmonization to language markers improves the performance of MCI detec- 
tion. We show the effects of subject variability from a quantitative perspective using a subject 
prediction task, and also from a qualitative perspective from visible clusters in the visual- 
ization of language markers. Our experiments show that the proposed subject harmonization 
approach effectively mitigates the subject variability so that the harmonized data has much 
less power to differentiate among subjects. Meanwhile, we show that MCI detection models 
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built from language markers harmonized by the proposed subject harmonization improve the 
predictive performance. The harmonization improves the AUC score of MCI prediction from 
0.594 to 0.646 in conversation classification task and from 0.626 to 0.657 in subject classifi- 
cation task. We further investigated the sub-group performance of different age/gender /years 
of education, and we see that the performance of most groups have been improved. 

Despite the improvement in prediction performance using language markers through the 
harmonization algorithm, future studies still need investigation. Firstly, the prediction per- 
formance from language markers is yet to be improved. A possible reason is the quality of 
the language markers and that we only used linguistic and syntactic information. We will 
study subject harmonization on additional feature variables, such as speech and video. Sec- 
ondly, performing subject harmonization on demographic variables witnessed reduced pre- 
dictive performance, indicating that the proposed deep harmonization network is currently 
not applicable to general harmonization usage. We plan to investigate theoretical relationship 
between the two harmonization types, and improve deep harmonization network to handle 
demographic variables. Thirdly, while we have successfully validated the positive impact of 
harmonization on language markers, it remains to confirm its efficacy on other data types. We 
plan to dedicate considerable time to applying the harmonization algorithm to different types 
of markers, such as clinical data or brain imaging data. This broader exploration will enable 
us to assess the generalizability and versatility of the harmonization technique across various 
data modalities, facilitating a more comprehensive understanding of its potential applications. 
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Scalar-Function Causal Discovery for Generating Causal Hypotheses with 
Observational Wearable Device Data 
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Digital health technologies such as wearable devices have transformed health data analyt- 
ics, providing continuous, high-resolution functional data on various health metrics, thereby 
opening new avenues for innovative research. In this work, we introduce a new approach for 
generating causal hypotheses for a pair of a continuous functional variable (e.g., physical ac- 
tivities recorded over time) and a binary scalar variable (e.g., mobility condition indicator). 
Our method goes beyond traditional association-focused approaches and has the poten- 
tial to reveal the underlying causal mechanism. We theoretically show that the proposed 
scalar-function causal model is identifiable with observational data alone. Our identifiability 
theory justifies the use of a simple yet principled algorithm to discern the causal relationship 
by comparing the likelihood functions of competing causal hypotheses. The robustness and 
applicability of our method are demonstrated through simulation studies and a real-world 
application using wearable device data from the National Health and Nutrition Examination 
Survey. 


Keywords: Causal identifiability, digital health, NHANES, observational data, wearable de- 
vice. 


1. Introduction 


The rise of wearable devices has revolutionized the way we collect and analyze health data, 
offering an unprecedented wealth of information about human health and behavior. These de- 
vices such as accelerometers and continuous glucose monitors allow for frequent measurement 
of various variables over time including physical activities, sleep patterns, electrocardiogram 
signals, and blood glucose levels. The availability of these measurements enables researchers to 
ask questions that previously could not be answered, e.g., how to quantify the effect of physical 
activities on all-cause mortality? In these types of scenarios, often, one variable (e.g., physical 
activities) is longitudinal/functional and the other (e.g., mortality) is a scalar. Thus, many 
statistical methods such as scalar-on-function regression models!®!" have been successfully 
deployed to estimate the association of the scalar-function pair. 

The focus of this paper is, however, different from the existing literature for modeling 
wearable device data. Instead of association, we investigate whether it is possible to discern the 
causal relationship between a scalar and a function. More specifically, we aim to identify which 
of the scalar-function pair is more likely to be the cause or effect given observational data alone. 


© 2023 The Authors. Open Access chapter published by World Scientific Publishing Company and 
distributed under the terms of the Creative Commons Attribution Non-Commercial (CC BY-NC) 
4.0 License. 
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We introduce a novel scalar-function causal discovery method to generate data-driven causal 
hypotheses. Revealing the causality underlying observed data can deepen our understanding 
of the physical mechanism involved in the data-generating process and potentially pave the 
way for better health interventions and policy-making. 

The field of causal discovery has seen a significant surge in interest and development 
over recent years due to wide-ranging applicability across various domains.*°:!?-1416,22 While 
traditional causal discovery methods are typically tailored to handle either continuous or 
discrete variables exclusively, real-world scenarios are often far more complex. For example, 
in the fields of social and health sciences, data frequently comprise a mix of different types of 
variables, necessitating more versatile approaches. 

In such scenarios, one may either discard discrete data or convert continuous data into a 
discrete form;™? either way, a lot of information contained in the original data is lost. In light 
of these limitations, there have been some recent developments to discover causality for mixed 
data.!*:?! However, these methods have only been developed for scalar variables, which cannot 
be used for functional data. To deal with functional data, some very recent works®?? have been 
proposed, which, however, cannot accomodate scalar and/or discrete data. In summary, to the 
best of our knowledge, there are no existing methods that can identify causality between a 
continuous functional variable and a binary scalar variable. 

This paper, therefore, aims to fill this critical gap in the causal discovery literature so that 
digital health researchers will have a powerful tool to identify causality in a wide range of 
observational wearable device data. Our approach is based on a probabilistic causal model 
that quantifies the likelihood of each possible causal direction (from function to scalar or from 
scalar to function). We theoretically establish the causal identifiability property of our model 
under common causal assumptions. Equipped with the identifiability property, we can simply 
identify causal directions based on likelihood functions. 

We conduct simulation studies to assess the empirical identifiability of the proposed 
method. In addition, to validate our method in real-world scenarios, we present an applica- 
tion with two variables that have a clear causal relationship. Specifically, we consider mobility 
conditions and physical activities. Since it is clear that mobility issues may lead to reduced 
activities, we will test whether our method can correctly identify such causal relationship with- 
out prior knowledge using the National Health and Nutrition Examination Survey (NHANES) 
data. 

The rest of the paper is organized in the following way. In Section 2, we describe the pro- 
posed scalar-function causal discovery model, theoretically prove that the causal relationship 
is identifiable, and develop a likelihood-based estimation procedure. In Section 3, we evaluate 
the proposed method through various simulations as well as a real wearable device dataset 
from NHANES, demonstrating its capability to correctly identify the true causal relationship. 
We conclude our paper with a brief discussion in Section 4. 
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2. Method 
2.1. Notations 


We use capital letters to denote random variables and small letters to denote their realized 
values. We use boldface to denote vectors or matrices and non-boldface to denote scalars. 
With a slight abuse of notation, we use P(-) to denote both probability mass and density 
functions, which can be understood from the context as it is determined by the type of the 
random variable under consideration. Let M"*" be the cone of n x n positive definite matrices. 


2.2. Causal Probability Model 


We are interested in identifying the causal relationship between two statistically dependent 
random variables: a random binary variable Y € {0,1} and a random function measured on n 
time points X = (X(t1),..., X(tn))' € R”. One can view these functional measurements as a 
finite realization of an infinite stochastic process X(-) such as the Gaussian process.!® 

We consider two competing causal hypotheses’, 


Ho: X + Y or X causes Y 
vs 
Hı: Y — X or Y causes X 
Under each hypothesis, we will set up a probability model. Specifically, let Py,y(X = z,Y = 
y) denote the probability model of Hp and Py_,x(X = æ,Y = y) denote the probability model 
of Hı. Using the probability chain rule, we have 
Py wy X= Oy =y) = Pxsy(Y =y| X =z). Pxsy(X = z), (1) 
Py4x(X =2,Y = y) =Pysx(X =z | Y =y)-Pyox(Y =y), 
where Px,y(Y =y | X = x) and Px_,y(X = x) are respectively the conditional and marginal 
probability distributions under Hp : X — Y and similarly Py,x(X = <æ | Y = y) and 
Py—„x(Y = y) are those under Hı : Y ~ X. Next, we will discuss the choice of these four 
probability distributions. 
For the marginal distribution of Y, we assume it to be a Bernoulli distribution with success 
probability p € (0,1), 
Pyx(¥ =y) = p (2) 
For the marginal distribution of X, we assume it to be a multivariate Gaussian distribution 
with mean p € R” and covariance matrix © € M"*”, 
Pxoy(X = 2) =N(a | p, 5), (3) 


where N(a | -,-) is the Gaussian probability density function evaluated at æ. 
To model the conditional distribution of Y given X, we adopt a linear logistic regression, 


a= _ T 
Pea 0k Say ee 


aNote that we are not performing null hypothesis testing. Our method is exploratory. 
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where ao is the intercept and a; 4 0 € R” are the slopes. That is, Y is conditionally Bernoulli, 
Pxay(¥ = y|X = æ) = $8(1- pe)” (4) 


with the success probability depending on X through a sigmoid transformation, 


at 1 
=] 4+ e72or' a 


Pa 
To specifiy Py_.x(X | Y), we employ a multivariate linear regression model, 
X=fot+YPirte 


where Bo € R” are the intercepts, 3; 4 0 € R” are the slopes, and e € R” are Gaussian errors 
with mean zero and covariance 2. The multivariate linear regression model above implies, 


Py4x(X =2|Y =y) =N(@ | 0y, Q) (5) 


with Oy = Bo + yf1. 
Putting (1)-(5) together, we have 


Pxoy(X =z,Y =y) = d4(1— $a) IN (2 | u, E) 
Py4x(X =x,Y =y) =N (E | 0y, Q) PA - p) 


2.3. Causal Identifiability 


Since we only have access to observational data, the two competing causal hypotheses may 
not be identifiable, i.e., Pxsy(X = z,Y = y) = Py.x(X = 2,Y = y) for all æ and y. For 
example, if both X and Y are Gaussian, they are not identifiable. Consequently, even with an 
infinite amount of data, one cannot tell these two causal models apart — clearly an undesirable 
feature. Fortunately, we will show, both theoretically and empirically, that the proposed model 
is identifiable. 


Definition 1 (Causal Identifiability). We say Ho and H; are identifiable if one cannot find 
any values of {ag, a, u, ©}? and {60, G1, Q, p}° such that Px.y(X = 2,Y = y) = Py4x(X = 
x,Y = y) for all x and y. 


Under the causal sufficiency assumption (i.e., there is no unmeasured confounder) com- 
monly adopted in the literature,?4:119,15,22,23 we have the following identifiability theorem. 


Theorem 1 (Causal Identifiability). Assuming causal sufficiency, the causal hypotheses 
Ho and H, are identifiable under model (6). 


Proof. We will show by contradiction. Suppose, 


Px sy(X =2,Y =y | 00,0, H, ©) = Py sx(X =2,Y = y | Bo, 81, Q, p) (7) 


>The parameters of Px_,y (X 


=g,Y = y) 
©The parameters of Py_,x(X = z,Y = y) 


204 


Pacific Symposium on Biocomputing 2024 


for all x € R” and y € {0,1}. Summing up both sides of (7) over y from 0 to 1, we have 


1 
X Pxay(¥ =y|X =2,00,0)Pxsy(X =€ | p, £) 
= (8) 


1 
=3 Pox =r | Y = y, Bo, 61, 2)Pysx(Y =y |p) 
y=0 


The left-hand side of (8) is given by 


1 
So Pxay(¥ =y| X =@,00,0)Pxy(X =£ | p, X) 
y=0 


1 
= Pxsy(X =2 | p, X) X Px sy (Y =y | X = z, a0, a) 
y=0 
= Pxsy(X =z | p, X) 
=N(z | u, X), (9) 


where the second equality is due to the law of total probability. 
The right-hand side of (8) is given by 


1 
X Pysx(X =2 |Y =y, bo, 61, 9)Pr>x(Y =y | p) 
y=0 


=p: Pysx(X =z |Y = 1, 6o, 61, Q) + (1 - p)- Pyox(X =z | Y = 0, 6o, 61, Q) 
= pN (€ | Bo + 61, Q) + (1 — p)N (€ | Bo, 2). (10) 
Note that (9) is a Gaussian distribution whereas (10) is a mixture of Gaussian distribution. 


Therefore, for them to be equivalent, we must have p = 0, p = 1, or Bı = 0, which are 
degenerated cases (i.e., either Y is deterministically 0 or 1, or X and Y are independent). 


Although our theorem relies on the causal sufficiency assumption, the experiments in Sec- 
tion 3.1.3 empirically show that the proposed method is relatively robust to the presence of 
unmeasured confounders. 


2.4. Estimation 


Theorem 1 establishes a property of the probability model and therefore is a population-level 
result. It implies that for a large enough sample size, one can correctly identify the correct 
causal hypothesis even with observational data alone. For a finite sample, our identifiability 
result paves the way for a simple, yet useful, causal discovery algorithm based on the maximum 
likelihood estimation (MLE). We aim to determine whether X causes Y or vice versa by 
quantifying the respective likelihoods. Therefore, when provided with a dataset of N subjects, 
(£1,y1),---, (£N, yn), we conclude Ho : X > Y if 
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max |] Pxoyv(X = xi, Y = yi | a9,a, p, ©) > Ge. [axt = x£;, Y = yi | Bo, 61, Q, p), 
and Hı : Y + X otherwise. Note that the two competing hypotheses have the same model 
complexity (i.e., the same number of parameters) and hence a model complexity penalty, 
which is typically needed for model selection, is not necessary here. The factorized form of 
the proposed model (6) allows us to separately find the MLE of each of its four components 
using existing standard techniques. 

However, we note that in our motivating application, X is high-dimensional (n = 1,440). 
For better statistical and computational efficiency, we choose to reduce its dimensionality 
before finding the MLE. Specifically, the functional principal component analysis (FPCA) is 
used, which can reduce the functional data into a few uncorrelated functional principal compo- 
nents (FPCs) that explain the most variation among all the functional bases. We decompose 
the covariance function of a stochastic process X(-) as, 


Cov(X(s), ere )wve(t), 


k=1 
where A;’s are the nonnegative eigenvalues in descending order and y;(-)’s are the correspond- 
ing orthogonal eigenfunctions. By the Karhunen-Loéve theorem, 


t) +X Zeve(t) 
k=1 


where p(t) = E[X(t)], {Vk(t)} g is referred to as the FPCs, and {Z,,(t)}?2, denotes the 
corresponding FPC scores. In practice, we would choose the first K <«< n FPC scores Z = 
(Zı,..., Zg)! that explain 99% variance and replace X by Z in the proposed model when 
finding the MLEs. 

Finally, to assess the uncertainty of our approach, we use the bootstrap? technique in 
our real data application. We first generate B bootstrap samples by resampling subjects 
with replacement. Each bootstrap sample has the same size as the original dataset. Then we 
apply our method to each bootstrap sample and record our choice between Hp and Hı. The 
proportion of times that we choose Ho or Hj, reflects our confidence toward each hypothesis. 


3. Experiments 


We first tested our model through various simulation scenarios on synthetic data where there is 
known ground truth. After confirming its effectiveness, we then applied our method to a real- 
world mobility-activity dataset, demonstrating its practical capability in generating plausible 
causal hypotheses. 


3.1. Simulations 


To assess the efficacy of the proposed model, we performed simulations on three different 
synthetic datasets including one with unmeasured confounders. Each simulation was repeated 
500 times, measuring the accuracy by the frequency at which we correctly identified the true 
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hypothesis. By considering varying sample sizes N between 50 and 200, we investigated the 
asymptotic behavior of our method. Furthermore, we examined the performance of the model 
under various signal strengths. For ease of exposition, 6; always denote the standard Gaussian 
white noises hereafter, i.e., 6; ~ N (0, I) where I is an identity matrix. 


3.1.1. Case 1: True Direction X => Y 


For each subject i = 1,..., N, the functional data x; were created by first sampling their mean 
m from a centered Gaussian process at n = 30 evenly spaced time points, 
m ~ GP(0,K) 


with the powered exponential covariance function, 
K(t, s) = exp{—|t — s|" }, 
of which the power « = 1.9, and then setting 
zi = m + ĝi. 
We performed the FPCA?4 on @1,...,æy using the R package fdapace, and retained first K 


FPCs that explained 99% variance. We denote the standardized FPC scores by z1,..., ZN- 
To create the causal dependency of y; on a; through z;, we generated y; from a probit 


regression, 
ik if aS 
VO: AP ye 0 


where 
yi =0.5 +z] 7+ 6 


with e; ~ N(0,1). Here y = (y1,-..., yK)! is the direct causal effect (signal), which will be 
varied at three levels: weak (+; = +1), moderate (+, = +1.5), and strong (yk = +3). 

The simulation results are reported in Table 1, showing an expected trend: the stronger the 
signal is, the more accurately the true causal direction can be discerned. Also, the accuracy ap- 
proaches 100% as the sample size increases for the moderate and strong signal cases. Even with 
the weak signal, the accuracy was still good, around 90%. Note that for a non-identifiability 
model, the expected accuracy is 50%. 


3.1.2. Case 2: True Direction Y => X 


Exploring the reverse causal direction, we first generated the binary cause variable y; from 
a Bernoulli distribution with a success probability of 0.5. Then we generated the functional 
effect variable, 


where m, ~ GP(0,K,) for y = 0,1 with the powered exponential covariance function K, of 
which the power « depends on y. Specifically, x = 1.9 if y = 1, and « = 0.3 (strong signal), 1.1 
(moderate signal), or 1.7 (weak signal) if y = 0. 


207 


Pacific Symposium on Biocomputing 2024 


Table 1: Simulations. Accuracy of the proposed model in determining true causal directions 
in synthetic datasets over 500 simulations. 


Sample size 


Case Confounder Signal 
50 100 150 200 
weak 92.8% 88.8% 90% 87.8% 
None moderate 92.8% 97.2% 97.6% 99.6% 
X >Y strong 93.8% 98.6% 99% 99.8% 
Functional 94.6% 98.2% 99.2% 99.2% 
Binary 92.4% 99.2% 100% 100% 
weak 82.2% 89% 89.4% 93.6% 
None moderate 83.4% 90.2% 96% 97.2% 
Y> X strong 97.8% 99% 99.4% 99.8% 
Functional 39.8% 61.6% 75% 83.8% 
Binary 63.6% 77% 85.2% 85.2% 


As anticipated, our simulation results (Table 1) show that as the signal or the sample size 
increases, the accuracy approaches 100%. 


3.1.3. Case 8: Hidden Confounders 


The Simulation Cases 1&2 above demonstrate the validity of Theorem 1, i.e., causal directions 
can be identified even with observational data alone. We now empirically assess the robustness 
of the proposed method with respect to the violation of the causal sufficiency assumption, i.e., 
we test whether our method can still identify the correct causal direction in the presence of 
unmeasured confounders. 

Our methodology hinges on determining the causality between two distinct types of vari- 
ables, binary scalar and continuous functional. Thus, accordingly, we considered that the 
unobserved confounders, generically denoted by C, are also either binary scalar or continuous 
functional. Consequently, we investigated four separate scenarios depicted in Fig. 1. We gen- 
erated data from these four causal graphs and hid C from our method (i.e., only took X and 
Y as the inputs of our algorithm). As before, we recorded the frequency at which we correctly 
identified the causal direction between X and Y. 

In Fig. 1 (a)&(b) where the confounder is binary, we generated the confounder c; from 
a Bernoulli distribution with success probability 0.5. In Fig. 1 (a), the mean me, of x; was 
generated from a conditional Gaussian process me ~ GP(0, Ke) with the powered exponential 
covariance function Ke of which the power « depends on c. Specifically, x = 1.9 if c = 1 and 
k = 1.5 if c = 0. Then as before, we set x; = me, + 6;. Finally, we generated y; from a probit 
regression model, y; = 1 if yf > 0 and y; = 0 otherwise, where 


yy =0.543-z2J/1K4+3-G4+6 
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(a) (b) (c) (d) 


@ Binary @ Functional 


Fig. 1: Four confounding scenarios under consideration in Simulation Case 3. 


with e; ~ N(0,1), 2;’s are the FPC scores of x;, and 1x = (1,...,1)' is a vector of ones with 
length K. 

In Fig. 1 (b), we generated y; from a Bernoulli distribution with the success probability 
pi dependent on c;. More precisely, p; = 0.9 if c; = 1 and p; = 0.1 if c; = 0. Subsequently, the 
mean Me, y, of x; was generated from a conditional Gaussian process Mey ~ GP(0,K.ey) with 
the powered exponential covariance function Ke of which the power « depends on both c and 
y. To be specific, x = 1.9 if c = 1 and y = 1, k = 0.5 if c = 1 and y = 0, «= 1.0 if c = 0 and 
y= 1, and « = 1.7 if c = 0 and y = 0. Finally, we set x; = me, y, + ôi. 

In Fig. 1 (c)&(d), the functional confounder c; was generated in the same way as a; in 
Case 1 with « = 1.5. Next, we performed the FPCA on cı,...,cy and retained the first J 
FPCs that explained 99% variance. We denote the standardized FPC scores by d1,..., dy. In 
Fig. 1 (c), m was first generated from a centered Gaussian process m ~ GP (0, K) with « = 1.9. 
Then the dependence on the confounder was introduced by setting 


xi =0.5+5-m+5-ci+ ði. 
Finally, we generated y; from a probit regression model, y; = 1 if yž > 0 and y; = 0 otherwise, 
where 
yi =0.5 +5- z] 1g +5- d] 1J+eéi 
with ei ~ N (0,1) and z;’s being the first K FPC scores of x;. 


In Fig. 1 (d), we first generated y; from a probit regression model, y; = 1 if yf > 0 and 
yi = 0 otherwise, where 


yf =0.5+3-dl1j+éi 


with e; ~ N (0,1). Then we generated mean processes my, from a conditional Gaussian process 
my ~ GP(0,K,). In this setting, the power « of the powered exponential covariance function 
Ky depended on y: x = 1.9 if y = 1, and « = 0.5 if y = 0. To introduce the influence of the 
confounder, we defined 

Li = My, +3: ci + ĝi. 


The results from these four confounding scenarios (Table 1) demonstrate the robustness 
of the proposed method. Particularly, as the sample size increased, our method achieved 
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increasingly better accuracy and was significantly better than a random guess for large sample 
sizes. 


3.2. Real Data 


Next, we applied the proposed methodology to the data collected by the NHANES. This ex- 
tensive study, conducted by the Centers for Disease Control, gathered a wide range of health 
and nutritional information about the U.S. population, including sociodemographic charac- 
teristics and various health conditions. To demonstrate the utility of the proposed method, 
we are particularly interested in two variables, physical activities X captured via hip-attached 
accelerometers and an indicator variable of mobility issues Y derived from self-reported house- 
hold interview data. Given the logical assumption of Y > X in this scenario, we aim to verify 
if our method can correctly identify this causal direction, primarily seeking to validate the 
effectiveness of our method in accurately determining causation from a known truth. 


3.2.1. Data Preprocessing 


Utilizing the NHANES dataset, we accessed activity data from hip-worn accelerometers dur- 
ing the 2003-2004 and 2005-2006 study waves. The magnitude of acceleration (movement 
“intensity” ) was captured using the ActiGraph AM-7164, delivering an objective measure of 
physical activity and bypassing the inconsistencies of self-reported data. Participants were 
instructed to wear the device for seven consecutive days, excluding swimming and bathing 
periods. The raw data were segmented into one-minute intervals or “epochs” with intensity 
readings accumulated per epoch and saved in long format (each row is a subject-minute). 

The well-formatted data are contained in the R package rnhanesdata."*® Following the 
preprocessing procedure in their paper, we included individuals aged 50 to 85 and omitted 
non-compliant individuals who have excessive missing accelerometer data, leaving us with 
N = 3,198 subjects. 

The activity data for each individual were aggregated over the 7-day period and trans- 
formed via log(1+ x). This dataset is organized in a 7N x 1440 matrix, with one row designated 
for each subject-day across all NHANES waves, where 7 denotes the days each subject wore 
the accelerometer, and 1440 corresponds to the total number of minutes in a day. 

The presence of any mobility issues was represented as a binary variable, categorized as 
either “No difficulty” or “Any difficulty,” based on responses from the Physical Functioning 
questionnaire. Individuals were classified under “Any difficulty” if they reported challenges 
in climbing 10 stairs, walking a quarter mile, abstained from these activities, or required 
special walking equipment. Overall, there are 32.4% subjects in the sample who experience 
any mobility of movement problem. 


3.2.2. Results 


We generated B = 100 bootstrap samples and successfully identified the correct causal direc- 
tion across all samples from comparing the maximized likelihoods of Y + X and X > Y: the 
mobility issue Y unambiguously impacts an individual’s level of physical activity X with high 
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Fig. 2: Real data. Histogram depicting the maximized log-likelihood differences between the 
competing hypotheses Y > X and X > Y. 


confidence. As depicted in Fig. 2, the histogram illustrates the difference in the maximized 
log-likelihoods between these two competing hypotheses (the former minus the latter), which 
is noticeably bounded away from zero, meaning that Y + X is far more likely than X > Y, 
which matches the presumed truth. 


4. Discussion 


In this paper, we have presented a new causal model for generating bivariate causal hypotheses 
with a continuous functional variable (e.g., physical activities) and a binary scalar variable 
(e.g., mobility issue indicator) in an exploratory fashion, which can provide insights as to 
which variable is more likely the cause. We theoretically proved that the underlying cause- 
effect relationship is identifiable with purely observational data under the causal sufficiency 
assumption. Empirically, we used a likelihood-based inference procedure and demonstrated the 
utility of the proposed method both under and beyond the causal sufficiency setting through 
simulation studies and a real-world wearable device application. 

There are several areas where this paper could be strengthened and extended. First, our 
NHANES application has focused on physical activities and mobility issue because of their 
clear causal relationship. Having demonstrated it is possible to identify their causal relation- 
ship, we plan to analyze other variables in the data to generate causal hypotheses in an 
exploratory manner, which is an intended use of the proposed method. 

Second, our identifiability theory operates under the assumption that there are no un- 
measured confounders. Even though our empirical investigations have indicated a degree of 
robustness to the presence of confounders, a theoretical exploration of identifiability within 
this context would be interesting and particularly relevant in observational studies where the 
presence of unmeasured confounders is common. 

Third, we have focused on the bivariate case and hence an extension to multivariate 
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cases and leveraging additional publicly available datasets can considerably broaden the 
method’s applicability. For example, the brain electroencephalogram dataset! comprises elec- 
troencephalogram signals collected over various trials with distinct stimuli for two groups - 
alcoholics and controls. By viewing the electroencephalogram signals as multivariate functional 
data, a recent paper?? attempts to discern the causal relationships among these functions. The 
multivariate extension of our method could potentially enrich this research by providing addi- 
tional insights into the causal relationships modified by the experimental groups by treating 
the group as a binary variable. Moreover, it should be relatively straightforward to extend our 
method to incorporate multiple categorical scalar variables. 

Finally, a Bayesian inference approach could be adopted especially for multivariate cases 
where efficient searching strategies in the causal graph space are required. A Bayesian approach 
would make it easier to make finite-sample inferences with natural uncertainty quantification 
for complex causal graphs. 
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Recent advancements in neuroimaging techniques have sparked a growing interest in under- 
standing the complex interactions between anatomical regions of interest (ROIs), forming 
into brain networks that play a crucial role in various clinical tasks, such as neural pat- 
tern discovery and disorder diagnosis. In recent years, graph neural networks (GNNs) have 
emerged as powerful tools for analyzing network data. However, due to the complexity of 
data acquisition and regulatory restrictions, brain network studies remain limited in scale 
and are often confined to local institutions. These limitations greatly challenge GNN mod- 
els to capture useful neural circuitry patterns and deliver robust downstream performance. 
As a distributed machine learning paradigm, federated learning (FL) provides a promising 
solution in addressing resource limitation and privacy concerns, by enabling collaborative 
learning across local institutions (i.e., clients) without data sharing. While the data het- 
erogeneity issues have been extensively studied in recent FL literature, cross-institutional 
brain network analysis presents unique data heterogeneity challenges, that is, the inconsis- 
tent ROI parcellation systems and varying predictive neural circuitry patterns across local 
neuroimaging studies. To this end, we propose FEDBRAIN, a GNN-based personalized FL 
framework that takes into account the unique properties of brain network data. Specifically, 
we present a federated atlas mapping mechanism to overcome the feature and structure 
heterogeneity of brain networks arising from different ROI atlas systems, and a clustering 
approach guided by clinical prior knowledge to address varying predictive neural circuitry 
patterns regarding different patient groups, neuroimaging modalities and clinical outcomes. 
Compared to existing FL strategies, our approach demonstrates superior and more consis- 
tent performance, showcasing its strong potential and generalizability in cross-institutional 
connectome-based brain imaging analysis. The implementation is available 


Keywords: Brain Connectome Analysis; Digital Health; Federated Learning 


1. Introduction 


In recent years, research in neuroscience has been driven to unravel the intricacies of the 
human brain and its connection to complex disorders such as bipolar disorder (BP) and 
Autism. Neuroimaging techniques, including fMRI and DTI, have emerged as crucial tools 
for facilitating the diagnosis of various diseases.' These techniques enable the construction 
of brain networks, which are essentially weighted connected graphs, where nodes represent 
anatomical regions of interest (ROIs) and edges represent their functional correlations or 


© 2023 The Authors. Open Access chapter published by World Scientific Publishing Company and 
distributed under the terms of the Creative Commons Attribution Non-Commercial (CC BY-NC) 
4.0 License. 
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structural connections. By analyzing these networks, researchers gain valuable insights into 
the biological structures and functions of complex neural systems, aiding in the early detection 
of neurological disorders and advancing fundamental neuroscience research. 

Graph Neural Networks (GNNs) have gained significant popularity in analyzing graph- 
structured data, demonstrating impressive performance across various domains like social 
networks, recommender systems, and gene/protein interactions.?’ In neuroscience, GNNs have 
been applied to brain network analysis, addressing tasks such as disease prediction and neural 
pattern discovery? However, deep learning models, including GNNs, heavily rely on large 
labeled datasets to obtain strong performance. Unfortunately, neuroimaging datasets are often 
relatively small due to the high complexity of data acquisition, preprocessing, and annotation, 
leading to significant model overfitting and limited generalization power.!%11 For instance, the 
popular datasets for BP and HIV analysis consist of only a few dozen subjects,!13 making it 
particularly challenging for GNNs to effectively capture important neural circuitry patterns 
from the noisy networks. While there exist several relatively large multi-site neuroimaging 
studies, the these are still small compared to datasets in typical ML domains.'4 

Recently, federated learning (FL) has emerged as a promising solution to address the chal- 
lenges of limited training data and computation resources in local studies.!°-'” FL operates by 
collaboratively training a centralized server model based on data privately stored by multiple 
local clients. The approach offers two notable advantages. First, it ensures privacy preserva- 
tion since clients solely communicate model parameters with the server. Second, it facilitates 
knowledge generalization by client aggregatiion which can mitigate the overfitting issues typi- 
cally associated with learning on small datasets. These aspects have contributed to the success 
of FL in various fields including healthcare applications!’ and graph learning.!? 

One significant challenge in FL is data heterogeneity, wherein the data distributions sig- 
nificantly differ across local data owners. Several FL algorithms!®!” have been proposed to 
tackle the data heterogeneity challenge. However, these methods mostly focus on label distri- 
butions and fail to address the unique data heterogeneity scenarios in cross-institutional brain 
network analysis which can manifest in two key aspects. First, since network parcellation is 
traditionally an ad hoc process carried out by domain experts, it is difficult to assume or 
require all different institutions to conform to the same ROI atlas mapping systems when 
preprocessing their neuroimaging data. As a result, this leads to misalignment in network 
structures and ROI features across clients. Second, different institutions collect brain network 
data for different patient groups, with different neuroimaging techniques and towards different 
clinical purposes, which results in varying underlying predictive neural circuitry patterns. 

In this work, we propose FEDBRAIN, a personalized FL framework designed for GNN- 
based brain network analysis. Our framework comprises three key components: a GNN-based 
FL backbone, a federated atlas mapping mechanism, and a guided client clustering mecha- 
nism. To build our FL platform, we use the well-established FedAvg as a foundation, and our 
default GNN structure is an optimized GCN model.4 To address the feature- and structure- 
wise heterogeneity issue due to potentially different atlas mapping systems used across local 
institutions, we introduce an autoencoder-based atlas mapping mechanism, which aims to 
project diverse ROI profiles onto a uniform sharable embedding space. To handle heteroge- 
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neous predictive neural circuitry patterns due to various neuroimaging modalities and clinical 
outcomes, we design a knowledge-guided client clustering mechanism by incorporating prior 
clinical knowledge into the dynamic clustering process of clients with similar data during FL. 
To showcase the effectiveness of FEDBRAIN on real-world datasets from different institu- 
tions, we conduct extensive empirical evaluations, comparing our framework to state-of-the-art 
methods. The results demonstrate that FEDBRAIN outperforms the baselines across all clients, 
with a minimum relative gain of 21.36% in accuracy. Moreover, we conduct ablation studies 
and specific analyses on the proposed federated atlas mapping and guided clustering mecha- 
nisms to fully understand their contribution and robustness within the framework. The results 
confirmed the necessity of these components in improving overall model performance. 


2. Related Work 


GNWNs for Brain Network Analysis. GNNs have gained significant attention for their ef- 
fectiveness in analyzing graph-structured data,?° 2? with several pioneering models applied to 
brain network analysis. Notable examples include BrainGNN,*® which uses ROJ-aware graph 
convolutional and ROI-selection pooling layers to predict neurological biomarkers from fMRI 
data. Another approach, BrainNetCNN,? adopts a CNN framework with various convolutional 
filters designed to leverage the topological locality of structural brain networks. BrainNetTF’ 
introduces a transformer architecture with an orthonormal clustering readout that considers 
ROI similarity within functional modules. Existing studies®?? *° have demonstrated GNNs can 
substantially improve performance in brain disorder predictions when sufficient data is avail- 
able. However, the difficulty emerges when dealing with limited training samples in practical 
scenarios, especially for particular clinical studies.?© This limitation hinders the full potential 
of GNNs for modeling brain network data, motivating designs capable of overcoming data 
scarcity and heterogeneity and improving performance in real clinical tasks. 


FL on Graphs. FL has gained significant attention for collaboratively training deep learning 
models while preserving data privacy. Recently, it has been proven to be effective in the 
context of graphs. Some of the pioneering works have explored modeling clients as nodes in 
graphs,?”?8 and benchmark surveys”? have contributed to the understanding of GNN-based FL 
across graphs in diverse data domains. FL on graphs can face a unique challenge, graph data 
heterogeneity. Some previous related works include FedCG?* which addresses the challenge 
of statistical heterogeneity in FL by leveraging GNN models to extract interactions across 
domains; GCFL*° which studies the specific graph-level heterogeneity across domains and 
proposes a dynamic clustered graph FL framework; and FedLit?! which proposes a way to 
dynamically cluster the latent link types of graphs in FL to address the link-level heterogeneity 
across graphs. Nonetheless, the distinct ways in which heterogeneity manifests in brain network 
studies, such as the variance in parcellation systems and neural circuitry patterns, make most 
FL frameworks that emphasize generic graph structure learning inapplicable. While research 
on GNN-based FL for neuroimaging data has shown promise, existing techniques focus on 
privacy preservation’? or domain adaptation.*? These objectives inherently diverge from our 
approach, which aspires to bolster data alignment and augment client personalization. 
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3. The FedBrain Framework 
3.1. The FL Backbone 


The backbone FL structure of FEDBRAIN is based on federated averaging (FedAvg).'° The 
essence is to aggregate the updated model parameters from local clients through a process of 
weighted averaging. These averaged parameters are then disseminated back to each client in 
the subsequent communication round. Specifically, when aggregating parameters, the server 
assigns a weight to each client in proportion to their respective sample size. 

We utilize an optimized GCN‘ as backbone for both the server and client models. The 
ROI (i.e., node) features are initialized with the connection profiles (i.e., adjacency).* That 
is, the feature matrix X is equivalent to the adjacency A (X = A), where A is parameterized 
by the node set V = {vn}; and the weighted edge set € = V x V. 


3.2. Federated Atlas Mapping 


Motivation. For brain network data, the ROI (7.e., node) parcellation is determined by the 
brain atlas. Once a template is chosen, all brain networks within a dataset share the same 
ROI identities. However, in our cross-institutional setting, different institutions may utilize 
different parcellation systems. This leads to heterogeneity in both sizes and structures of the 
parcellated networks, as well as divergent meanings of ROI features (7.e., connectivity profiles). 
While it is possible to manually convert between atlases, this process is laborious and requires 
extensive domain expertise. Therefore, we propose a data-driven transformation, as a pre- 
processing mechanism, that aims to align network features and structures across institutions, 
ensuring consistency in network dimensions and physical interpretations of features. 


Autoencoder framework. To achieve uniform feature dimensions and network sizes, we 
employ a one-layer linear autoencoder (AE) to learn a dataset-specific projection. Given a 
target dimension M that is consistent across all datasets and an input feature X e RNXN 
(N > M), the objective is to learn a linear projection W e RN*™, such that the projected 
representations preserve as much information as possible from the original features. The AE 
is optimized using the mean-squared-error (MSE) reconstruction objective, denoted as Lrec = 
(1/N)||X — XWW! ||?. Intuitively, the projection W transforms initial features by applying 
a weighted linear combination on the original dimensions. Consequently, the columns of W 
learns to assign original dimensions into M groups. We exploit this concept to condense 
the network structure. To reduce the computational complexity, we formulate an assignment 
matrix Z € RY*M such that Zi; = 1[Wi; € arg top k (col;(W))]. The matrix Z records the 
top-k greatest entries per each column in W and zeros out the rest. Ultimately, given a graph 
adjacency matrix A (= X), we construct a compressed network A’ by evaluating A’ = Z' AZ. 


Federated training. Apart from dataset-specific projections, aligning the physical inter- 
pretations of projected features across datasets is equally vital to mitigate structure- and 
feature-level heterogeneity. To achieve this, we leverage the FL approach to train the autoen- 
coders with the intention of obtaining a global atlas projection. However, the architectural 
sizes of autoencoders across clients can vary due to the differing original data dimensions, 
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which makes it challenging to communicate model parameters. 

To address this issue, we propose a unified mapping method that aims to adapt the size 
of the global model to the varying dimensionality of each local dataset. Given a global pro- 
jection Wa € RXe*™ based on the most detailed parcellation template with Ng defined 
ROIs, and a coarser template with Nz defined ROIs (Ng < Ng) employed for local data, 
our goal is to derive an assignment matrix P; c RN“*Nce, which ensures the local projection 
W, € R™:*™ is distributed through the mapping Wz = P,W<. To achieve this, we leverage 
the 3D coordinates of the ROIs, denoted as Dg € R%°*? for the global parcellation template 
and Dr € R+*3 for the local template. We first calculate a distance matrix S € RN«*Ne, 
where Si; = d(row;(Dz), row;(Da)) represents the pairwise Euclidean distance between ROIs 
from the two templates. We then designate Py, , = 1[S;; = arg min (col,;($))]. This implied that 
we only consider the minimum entry per each column of S. Essentially, we enable Pzr to learn 
a mapping that groups ROIs in the global template with those in the local template, based 
on their spatial proximity. During each communication round, clients start by downloading 
the server’s parameter by applying the mapping Wz = Pr Wg. Subsequently, each client sends 
their updated parameters back to the server, employing the inverse mapping W; = PĮ Wž. 


3.3. Guided Clustering 


Motivation. Beyond the discrepancies in network parcellation systems, another significant 
source of heterogeneity originates from the variability in predictive neural circuitry patterns, 
encompassing data modalities and clinical outcomes. These variances can result in a subop- 
timal adaptation of the generalized global model to specific local objectives. Therefore, our 
aim is to strike a balance between global generalization and local personalization. Moreover, 
as shown in Table |1| we notice that similar neural patterns are shared among certain client 
institution subgroups. This motivates us to integrate client clustering??:4 into the FL process. 


Clustered FL. When data distributions are similar among local clients, the average global 
model can achieve convergence for all local objectives. However, in instances of heterogeneity, 
the global model fails to adapt to local optimizations, resulting in stationary point conver- 
gence.*4 To mitigate stationary convergence, clients can be assigned to clusters with homoge- 
neous data distributions, thereby initiating cluster-specific FL subroutines. 


Constrained clustering. While gradient-based clustering effectively addresses the station- 
ary point issue and improves performance over the basic FedAvg, the method is entirely data- 
driven, lacking consideration of shared clinical prior knowledge related to the neural circuitry 
patterns of each client. Consequently, heterogeneity may still exist within the formed clus- 
ters, necessitating further division of clusters. This often leads to the creation of singleton 
clusters, undermining the essence of collaborative learning. This phenomenon is demonstrated 
in Figure |2| (Section [4.4p. Based on these observations, we propose a refined variant of the 
clustering method that incorporates shared prior knowledge to guide the clustering process. 
For instance, in terms of data modalities, it is intuitive to group clients with similar ROI 
connectivities and MRI data. Likewise, with regard to clinical outcomes, FL on a cluster level 
could benefit from learning similar objectives. To this end, we create must-links between pairs 
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of clients that exhibit highly similar neural patterns and define cannot-links for those that 
don’t. We introduce a weighted reward Amust and penalty Acannot term, which are multiplied 
to the pairwise client similarity measure when creating must- and cannot-links. 


4. Experiments 


Datasets. We evaluate our framework using six real-world brain network datasets: BP,'” 
HIV,’ PPMI, PNC, ABIDE,?’ and ABCD.*® We present key statistics for each dataset in 
Table [I] Among them, BP, HIV, and PPMI contain multiple data modalities. In light of this, 
we propose to employ every such modality to be learned on a separate FL client. Based on the 
available label information, we define two possible tasks — disease prediction (i.e., patients vs. 
health controls) and gender prediction — both in the form of binary classification. 


Table 1. Dataset statistics. 


Dataset Modality Sample Size Atlas Network Size Outcome Class Number 

BP fMRI, DTI 97 Brodmann 82 82 x 82 Disease 2 

HIV fMRI, DTI 70 AAL 90 90 x 90 Disease 2 
PPMI  PICo, Hough, FSL 754 Desikan-Killiany 84 84 x 84 Disease 2 

PNC fMRI 503 Power 264 264 x 264 Gender 2 
ABIDE fMRI 1009 Craddock 200 200 x 200 Disease 2 
ABCD fMRI 7901 HCP 360 360 x 360 Gender 2 


Parameter setup. The downstream classifier consists of a single-layer MLP, and we use the 
negative log-likelihood measure as the optimization objective and accuracy as the evaluation 
metric. In the case of all FL baselines, a complete training procedure encompasses 80 com- 
munication rounds. For the self-train (i.e., non-FL) baseline, each local model is trained 
for 80 epochs. Regarding FEDBRAIN, we retain the top 3 entries in each column of the atlas 
mapping projection matrix for network transformation, and use the most detailed HCP 360 
template to define the global model for our federated training of AEs. 


Empirical analyses. The following sections are structured to assess (1) the performance of 
FEDBRAIN in comparison to widely adopted FL frameworks, and (2) the contribution of the 
key components to the overall performance, supplemented by case studies. 


4.1. Overall performance comparison (RQ1) 


We present a comprehensive performance comparison in Table |2} We include the client (i.e., 
dataset) name, along with its modality name if it contains multiple; average accuracy per 
each client; combined accuracy across all clients; and the minimum client-wise gain over the 
self-train baseline. To ensure fair comparisons, we apply the same GNN architecture and 
parameter setup to all methods. Our analysis reveals several key observations. 

Firstly, FL baselines show significant improvement over self-train, with an average rel- 
ative gain of 15.34% across all clients. Notably, clients with smaller sample sizes, like BP, 
HIV, and PNC, experience the most substantial performance enhancement, with an average 
relative gain of 19.31%. This highlights the valuable effect of collaborative learning and cross- 
institutional knowledge generalization in overcoming model overfitting on limited training 
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Table 2. Performance for each client is averaged from 10-fold cross-validation, the combined per- 
formance is averaged across all clients. We highlight the best in bold and the runner-up underlined. 


Clients BP-f{MRI BP-DTI HIV-fMRI HIV-DTI PPMI-PICo 
Accuracy average 
self-train 0.5463(+0.019) 0.5012(+0.082) 0.5286(+0.035) 0.4571(+£0.140) 0.6394(+0.034) 
FedAvg 0.6037(+40.073) 0.5158(+0.013) 0.5457(+0.153) 0.5000(+0.078) 0.7925(+0.002) 
FedProx 0.6084(+0.117) 0.5853(+0.085) 0.6200(+0.132) 0.6029(+0.097) 0.7925(+0.002) 
SCAFFOLD 0.5800(+0.120) 0.6400(+0.049) 0.6343(+0.070) 0.6629(+0.057) 0.7778(+0.000) 
FEDBRAIN 0.7389(+0.066) 0.7500(40.077) 0.7857(+0.071) 0.8143(+0.070) 0.8102(+0.010) 
PPMI-Hough PPMI-FSL PNC ABIDE ABCD 
average combine min gain 
0.6570(+0.054) 0.6852(+0.041) 0.5034(+0.052) 0.5025(+0.007) 0.5342(+0.002) 0.5555(+0.073) = 
0.7633(+0.031) 0.7925(+0.002) 0.5434(+0.008) 0.5044(+0.012) 0.5167(+0.017) 0.6078(+0.118) -0.032 
0.7536(+0.037) 0.7925(+0.002) 0.6057(+0.018) 0.5594(+0.003) 0.5700(+0.020) 0.6490(+0.088) 0.067 
0.7944(+0.014) 0.7889(+0.014) 0.6015(+0.009) 0.5765(+0.090) 0.5980(+0.045) 0.6654(+0.084) 0.120 
0.8102(-0.010) 0.8095(+0.010) 0.7275(+0.044) 0.6549(+0.034) 0.7033(+0.033) 0.7605(+0.052) 0.214 


resources. Moreover, FL training also results in slight performance improvements on larger 
datasets, such as PPMI, ABIDE, and ABCD, underscoring the positive impact of a global op- 
timization scheme in enhancing local performance. However, it is worth noting that among the 
chosen FL baselines, there is a slightly increased performance variance across clients, mainly 
due to underlying heterogeneity arising from the unique characteristics of brain network data. 

Secondly, among all the selected FL baselines, SCAFFOLD stands out as the top performer, 
exhibiting an impressive average gain of 5.89% over its competitors. This result highlights 
the robustness of SCAFFOLD in addressing client heterogeneity through controlled gradient 
correction. Additionally, along with FedProx, which is also capable of handling data and 
system heterogeneity, the performance variance is reduced compared to FedAvg. This further 
aligns with our motivation to develop a specialized solution for reducing brain network-specific 
heterogeneity, which is aimed to unleash the full potential of collaborative learning, reflected 
through enhanced performance across multiple datasets at greater consistency. 

Lastly, FEDBRAIN outperforms SCAFFOLD by a relative margin of 14.29%, while also sig- 
nificantly reducing performance variance across clients, indicating the value of tailoring FL 
approaches to consider the unique properties and characteristics of brain network data. More- 
over, FEDBRAIN demonstrates statistically significant improvements over the compared base- 
lines, as validated by passing the paired t-test with p = 0.05 in comparison to all methods. 


Table 3. Atlas mapping comparisons. Table 4. Guided clustering comparisons. 
Accuracy average min gain Accuracy average min gain 
No Atlas Mapping 0.6845(+0.068) E No Clustering 0.6921(+0.071) z 
Atlas Mapping 0.7246(+0.063) 0.0039 Non-guided Clustering 0.7231(+0.065) 0.0000 
Federated Atlas Mapping 0.7605(+0.052) 0.0214 Guided Clustering 0.7605(+0.052) 0.0000 


4.2. Ablation studies (RQ2) 


We analyze the two key components of FEDBRAIN: federated atlas mapping and guided clus- 
tering. To highlight the contribution of each, we keep the best configuration of one component 
fixed while evaluating the other. The results are presented in Table |3| and Table |4| where 
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we present an averaged performance across all clients. Regarding the analysis for atlas map- 
ping, we investigate its impact on overall performance both without the entire module and 
without federated training. When atlas mapping is not applied, we add a learnable linear 
projection head to the client’s GNN model that is excluded from the FL process. In general, 
we make two main observations: (1) Ensuring consistency in feature and network dimensions 
reflects in a relative gain of 6.12% compared to the uncompressed baseline. (2) Aligning the 
physical meanings of projected features further boosts performance by 4.95%, showcasing its 
effectiveness in countering incongruous ROI parcellation systems. 

Regarding client clustering, we compare two scenarios: without clustering and without 
shared prior knowledge guidance. Our key observations are as follows: (1) Personalizing client 
optimization through similarity-based clustering leads to a significant enhancement in down- 
stream performance, with a relative margin of 4.48%. (2) By integrating clinical prior knowl- 
edge and constraints, we further enhance cluster-specific learning and knowledge generaliza- 
tion, resulting in a relative gain of 5.17% and a reduction in performance variance. 


4.3. Heterogeneity analysis of federated atlas mapping (RQ3) 
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Fig. 1. Pairwise structure- (upper) and feature-level (lower) heterogeneity measures across all 
datasets compared on brain networks processed without atlas mapping (left), with atlas mapping but 
without federated training (mid), and full federated atlas mapping (right). The smaller the numeric 
measure, the less heterogeneity exists within the investigated pair. 


To validate the contribution of the proposed federated atlas mapping in reducing structure- 
and feature-level heterogeneity, we employ two distinct quantitative metrics”? to evaluate the 
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averaged heterogeneity measure among brain networks across every pair of datasets. Firstly, re- 
garding structure-level heterogeneity, we leverage the Anonymous Walk Embeddings (AWEs)?? 
technique to generate representations for each brain network graph. We then calculate the 
Jensen-Shannon distance between every pair of AWE representations. Secondly, regarding 
feature-level heterogeneity, we analyze the empirical distribution of feature similarity between 
all pairs of linked nodes (ROIs) present in each graph. We then compute the Jensen-Shannon 
divergence between each pair of these distributions. We present our findings in Figure[1] Specif- 
ically, we compare the heterogeneity measures among brain networks and features processed 
under three scenarios: without federated atlas mapping, with atlas mapping but without fed- 
erated training, and with full federated atlas mapping. Our observation suggests that atlas 
mapping along with federated training significantly reduces the level of heterogeneity across 
datasets in both network structures and ROI features. 

In addition, we investigate the individual influence of the transformed network structure 
and ROI features on downstream performance. The summarized results can be found in Ta- 
ble |5| We observe that learning from either transformed network structures or ROI features 
leads to an average relative gain of 4.68% over the non-transformation baseline. The best 
performance is achieved when learning from both transformed structures and features, further 
validating the robustness of our design in reducing heterogeneity and enhancing task-wise 
performance simultaneously. Furthermore, we observe a significant reduction in time com- 
plexity when learned on transformed data. Given the original network and feature dimension 
N, a transformed dimension M (M < N), and a hidden size of F of the l-layer GNN model, 
the bounded complexity reduces from O(I(N?F + NF?)) to O(l(M?F + MF’)). Reflecting this 
to actual FL training with 80 communication rounds, the transformation reduces the time 
consumption from roughly 612 seconds to 266 seconds in completion time. 


Table 5. Network transformation comparisons. Table 6. Cluster constraints comparisons. 
Transformation average min gain Link average min gain 
None 0.6845(+0.068) — None 0.7231 (+0.065) T 
Structure 0.7042(+0.070) -0.0126 Cannot 0.7337(+0.061) 0.0089 
Feature 0.7288(+0.060) 0.0357 Must 0.7445(+0.057) 0.0148 
Structure & Feature 0.7605(+0.052) 0.0417 Cannot & Must 0.7605(+0.052) 0.0235 


4.4. Clustering analysis of guided clustering (RQ4) 


We investigate the impact of the guided clustering approach on cluster formation. We focus 
on evaluating the effectiveness of this mechanism in grouping institutions (i.e., clients) with 
similar neural circuitry patterns while also maintaining reasonable cluster sizes. We compare 
the outcomes with those obtained from the standard hierarchical clustering. We show a den- 
drogram visualization of the cluster results in Figure [] Specifically, the linked branches depict 
the hierarchical relationships, with blue-colored lines representing singleton clusters, and other 
colors highlighting cluster assignments. Our observations indicate that incorporating clinical 
prior knowledge guidance substantially enhances the capability to identify and group clients 
with similar or near identical neural circuitry patterns. Our approach also avoids the produc- 
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Fig. 2. Dendrogram visualization of cluster results from standard hierarchical clustering (left) and 
prior knowledge guided clustering (right). We list the client names alongside its clinical outcomes 
(e.g., disease/gender) and data modalities (e.g., functional/structural connectivities). 


tion of singleton clusters, which were prominent when using the standard method. 

Moreover, we study the impact on downstream performance when using clustering guidance 
that exclusively relies on either must- or cannot-link information. The results are presented 
in Table (6| We observe that sole cannot-link constraints lead to a relative gain of 1.47% over 
standard clustering. When guided by must-links alone, we achieve a further improvement of 
1.53%, bringing the performance to within a mere 2.10% difference from considering both 
constraints. The findings suggest that must-link information plays a slightly more influential 
role in identifying similar neural circuitry patterns. On the other hand, cannot-link informa- 
tion proves valuable in averting additional intra-cluster heterogeneity, thereby reducing the 
likelihood of further cluster division and the formation of singleton clusters. 


5. Conclusion 


Cross-institutional brain network analysis has been a challenging task for conventional FL 
frameworks and GNN models. The presence of unique data heterogeneity, particularly in 
terms of inconsistent ROI parcellation systems and predictive neural circuitry patterns, poses 
a significant obstacle to effective collaborative training and knowledge generalization. To tackle 
these challenges, we propose FEDBRAIN, a personalized GNN-based FL framework. Specifi- 
cally, we leverage a data-driven atlas mapping mechanism to address the issue of incompatible 
ROI parcellation systems. Moreover, we incorporate clustered FL to enhance client personal- 
ization and integrate clinical prior knowledge to guide the clustering process. We conducted 
extensive experiments on multiple real-world brain network studies, demonstrating the supe- 
rior performance of FEDBRAIN compared to various state-of-the-art FL baselines. 

We direct our future efforts to enhance FEDBRAIN by addressing current limitations. 
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Firstly, we’ll expand data considerations to include a wider array of atlas templates, clinical 
tasks, and clients with multi-modal data. Secondly, we’ll optimize computational efficiency as 
the framework becomes more sophisticated. Thirdly, we'll delve into theoretical investigations 
to ensure strong privacy guarantees. Lastly, we plan to broaden empirical investigations by 
incorporating a broader set of data to validate the framework’s robustness. 
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1. Drug development and repurposing 


Drug discovery and development is a long and high-risk process with cumulative annual costs 
approaching $1 billion US dollars (Hinkson et al. 2020)-(Wouters et al. 2020), where over 85% of 
drug candidates will fail prior to completing clinical trials (Nielsch et al. 2016). Drug repurposing 
or repositioning of existing medications for new therapeutic uses can substantially reduce costs, 
time, and effort while providing additional treatment options to patients. 


The increasing availability of large-scale electronical medical record (EMR) data, in combination 
with genomic, proteomic, molecular, and other biomedical data is enabling more cost-effective 
investigations of treatment response, adverse event profiling, and novel target identification. The 
use of “real-world data” presents a promising solution with the potential to dramatically reduce 
development time and cost. Furthermore, policy makers in the US and other countries are 
increasingly open to considering alternative sources of evidence beyond clinical trials in their 
decision-making processes. For instance, the 21st Century Cures Act encourages the use of real- 
world data to generate evidence of product effectiveness to help support approval of new indications 
for existing drugs (Dagenais et al. 2022). 


Traditionally these sources encompass data from hospital or population-based health records, third- 
party health insurance claims, registries, and health surveys (Administration 2018). These data types 
are increasingly linked to novel types of biomedical data, such as genomic (or other “omic”) data 
from large biobanks, biopsies, pathology tests, diagnostic imaging, and information related to social 
determinants of health (SDoH). Initiatives focused on data-driven drug repurposing, leveraging the 
expansive resources available within real-world data repositories have the potential to improve the 
efficiency of identifying potential treatments, while simultaneously reducing possible risks 
associated with drug development. 


Computational approaches, such as machine learning, offer a powerful avenue to address specific 
challenges in drug development by harnessing the wealth of multi-dimensional data from various 
sources. For example, prior research has demonstrated the ability of machine learning algorithms to 
scan compound libraries to optimize the design of small molecules and evaluate molecular docking 
to estimate drug-target interactions, and use this to find repurposing targets for viral infections and 
cancer (Kumar et al. 2015; Mirza et al. 2016; Wang et al. 2017). 


Phenotype-first approaches used in conjunction with machine learning are yet another example of 
identifying targets for drug repurposing. This methodology identifies a set of optimal treatment 
modalities using medical record history which has demonstrated increased efficacy of clinical 
conditions. This approach capitalizes on the growing availability of EMR data to evaluate acute and 
long-term therapeutic response based on individual-level, real-world clinical data. These deep- 
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learning computational approaches can be coupled with genetics and advanced -omics data to 
elucidate the underlying mechanisms of disease. When combined with available drug-target 
datasets, this information can facilitate the identification of alternative treatment strategies (Allen et 
al. 2015; MacEachern and Forkert 2021; Wang et al. 2021; Xu et al. 2022). 


Similarly, novel computational approaches that leverage genomic and transcriptomic 
methodologies, including but not limited to genome-wide association studies, genetically predicted 
gene expression analysis, and Mendelian randomization have the potential to identify and estimate 
the effect of drug repurposing on reducing risk of disease. These approaches are particularly 
appealing, given that drugs with genetic evidence from disease association studies have a two-fold 
higher likelihood of successfully reaching the market (Nelson et al. 2015; King et al. 2019). 


The research team encompassing this panel has experience in developing such computational 
pipelines for identification of potential drug candidates for repurposing in diabetes treatment 
(Khankari et al. 2022; Shuey et al. 2022). This approach specifically uses a transcriptome-driven 
drug screening approach to identify candidate therapeutics. Subsequently, it validates these 
candidates through a two-step process by: 1) generating real-world evidence for drug efficacy using 
a self-controlled case series study design using large EMR datasets and quantifying changes in 
disease-associated biomarkers before and after treatment with identified candidates, and 2) generate 
genetic evidence for drug target efficacy for disease using the Mendelian randomization framework. 
We encourage participation in this series by other researchers who are involved in the development 
of strategies to aid in the identification and evaluation of drug repurposing opportunities. 


2. Session contents 


Here we describe briefly studies which will be presented during the session. 


2.1. List of topics captured in this session 


Our session includes presentations on the following diverse topics related to drug repurposing and 
discovery: 

e Modeling of outcome risk based on medication exposure using propensity score matching 

e Improved techniques for target identification from sequencing data 

e Machine learning modeling of disease and protein interaction networks (???) 

e High-throughput functional screening assays 


2.2. Systematic Estimation of Treatment Effect on Hospitalization Risk as a Drug 
Repurposing Screening Method 


In this manuscript, Georgantas et al. propose a simple, pragmatic screening approach for drug 
repurposing using real-world data. They incorporated time-to-event and propensity score matching 
with observational data from the UK Biobank to evaluate the roles of thousands of drug-disease 
pairs on hospitalization risk. This elegant use of high-dimensional real-world data suggests 
numerous repurposing opportunities for existing, commonly prescribed medications. 
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2.3. Transcript-aware analysis of rare predicted loss-of-function variants in the UK 
Biobank elucidate new isoform-trait associations 


As whole exome and genome sequencing becomes more widely accessible, the ability to synthesize 
these results into meaningful discoveries is essential. Traditional burden testing approaches assume 
that all variants in a given gene have similar effects on gene function and fails to consider isoforms 
where this assumption is often violated. Hoffing et al. demonstrates how using transcript-specific 
annotations (rather than collapsed gene-based evaluations) to classify rare predicted loss-of-function 
(pLOF) mutations can dramatically impact effect estimates for rare variant association analyses. 
Their work links such pLOFs to tissue specificity, quantitative endophenotypes, and disease 
outcomes and has a distinct outcome for improving the outputs of such large-scale sequencing data 
for drug target identification. The results of this study have the potential to improve accuracy of rare 
variant-disease association studies that often serve to identify novel drug targets. 


2.4. Formulating new drug repurposing hypotheses using disease-specific hypergraphs 


In Jain et al., the authors use disease-specific hypergraphs in which hyperedges of various lengths 
encode biological pathways to generate new repurposing targets which may be overlooked by classic 
knowledge graphs. These low-dimensional representations of drug-to-gene pathways are filtered to 
existing therapeutic approaches for Alzheimer’s Disease and then evaluated using the multiscale 
interactome (MSI). Further, the seven targets not represented in MSI were evaluated by literature 
review, with many of these candidates having demonstrable impacts on brain development or 
disease processing that support a relationship with Alzheimer’s Disease. 


2.5. Combined kinome inhibition states are predictive of cancer cell line sensitivity to kinase 
inhibitor combination therapies 


Kinase inhibitors are a staple in clinical oncology; however, monotherapy may lead to resistance in 
part due to compensation by other members of the kinase network or kinome. Combinatorial 
therapies have been suggested to combat this resistance. However, determining the best combination 
of kinase inhibitors is essential. To this end, Joisa et al. developed a high-throughput platform for 
evaluating combinatorial effects of multiple kinase inhibitors. By leveraging heterogenous data for 
the prediction of potential drug combination targets the authors identified the combination of MEK 
and PI3K inhibitors (Trametinib/Omipalisib). Their results are supported by this particular 
combination of inhibitors entering a recent phase | clinical trial which suggesting the potential for 
this method to identify other combinatorial therapies. 


2.6. The Human Protein Structure Targetome 


Ovanessians et al. utilized structure-based modeling of proteins for more than 20,000 human 
proteins curated from various protein databases to build a human “targetome”. This approach was 
developed to prioritize protein-ligand pairs and accounts for the complexities of both protein 
structure and binding site affinities to prioritize drug targeting. The potential of this pipeline and 
strategies like this have the potential to advance drug design and development efforts by not only 
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prioritizing candidates but informing various considerations in the drug development pipeline 
including competitive binding estimation. 


2.7. Modeling Path Importance for Effective Alzheimer’s Disease Drug Repurposing 


The final manuscript in this session by Xiang et al. presents a modeling schema focused on building 
a large-scale protein-protein interaction network from various data sources. Their approach 
incorporated both available data about protein-protein interactions and existing drug-target 
interactions to develop a rich data resource for prioritization of biological systems, e.g. networks 
and pathways. Their models captured a network’s rich topology and challenges the assumption that 
paths of equal length have equivalent importance in biological systems. Results were further 
supported by the prioritization of several drug candidates that are supported by previous publications 
and insurance claims data. 


3. Conclusion 


The authors in this session present six diverse papers that discuss methodologic improvements to 
guide potential drug discovery and repurposing. The session expands upon the application of 
commonly used techniques like improving prediction of loss-of-function mutations for target 
identification as well as modeling strategies using genetic data to evaluate medication exposure and 
outcomes. There is also a special emphasis on using machine learning techniques and available 
datasets to identify drug targets by considering disease, protein, and kinase interactions. We 
anticipate that these studies, results, and associated techniques can advance disease-specific target 
evaluation and drug repurposing. 
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Drug repurposing (DR) intends to identify new uses for approved medications outside their 
original indication. Computational methods for finding DR candidates usually rely on prior 
biological and chemical information on a specific drug or target but rarely utilize real-world 
observations. In this work, we propose a simple and effective systematic screening approach 
to measure medication impact on hospitalization risk based on large-scale observational 
data. We use common classification systems to group drugs and diseases into broader func- 
tional categories and test for non-zero effects in each drug-disease category pair. Treat- 
ment effects on the hospitalization risk of an individual disease are obtained by combining 
widely used methods for causal inference and time-to-event modelling. 6468 drug-disease 
pairs were tested using data from the UK Biobank, focusing on cardiovascular, metabolic, 
and respiratory diseases. We determined key parameters to reduce the number of spurious 
correlations and identified 7 statistically significant associations of reduced hospitalization 
risk after correcting for multiple testing. Some of these associations were already reported 
in other studies, including new potential applications for cardioselective beta-blockers and 
thiazides. We also found evidence for proton pump inhibitor side effects and multiple pos- 
sible associations for anti-diabetic drugs. Our work demonstrates the applicability of the 
present screening approach and the utility of real-world data for identifying potential DR 
candidates. 


Keywords: Drug repurposing; Propensity score matching; Cox regression; Real-world data 


1. Introduction 


Drug discovery is a rarely successful and extremely costly process that can span decades before 
commercialization. Drug repurposing (DR), or re-utilizing an existing medication for another 
use, has the potential to cut down the cost of development by a factor of 10.1 DR is still 
dependent on clinical trial success and only approximately 30% of repurposed drugs go from 
phase I to market,” a process that can take multiple years. The majority of trials fail due to 
insufficient efficacy or the existence of other superior alternatives. Computational methods can 
reduce the chances of trial failure by selecting candidates that are likely to succeed and have 
already resulted in the identification of approved medications and promising candidates.** 
A large number of computational DR approaches attempt to identify drug-disease associa- 


© 2023 The Authors. Open Access chapter published by World Scientific Publishing Company and 
distributed under the terms of the Creative Commons Attribution Non-Commercial (CC BY-NC) 
4.0 License. 


232 


Pacific Symposium on Biocomputing 2024 


tions by utilizing molecule structure, common pathways, or other known biological properties.° 
Signature matching and molecular docking use structural and chemical properties of molecules 
to identify similar drugs and therapeutic targets. Other approaches use genome-wide sum- 
mary statistics or biological pathway information to identify causal genes and new potential 
targets.’ These methods generally attempt to use known information on the drug or disease 
in question to infer new treatment options. 

Alternatively, electronic health records (EHRs) were used to identify potential alternative 
treatment targets based on documentation of side effects and clinical events. Eguale et al.!° 
used EHR data and Cox regression to associate off-label drug use with adverse drug events, 
Wu et al. have recently proposed another type of screening method using EHR records for the 
identification of drug-disease interactions.!! Similarly, UK Biobank data has also been used 
to identify relations between treatment and phenotype, although these approaches generally 
focus on a specific phenotype and treatment pair. For instance, Ma et al.!? used Cox regression 
in UK Biobank data to identify the benefits of glucosamine for type 2 diabetes. Pilling et 
al.!8 also used time-to-event modelling in UK Biobank to link lower vitamin D levels and 
hospitalization for delirium. Wu et al.'* used PSM in UK Biobank for cost-benefit analysis of 
bariatric surgery. 

Nevertheless, utilizing real-world data to isolate the effect of medication has proven chal- 
lenging as this approach is highly prone to bias with the risk of creating spurious associations.!° 
Indeed, in observational data, the characteristics of the treatment group are often very dif- 
ferent from the average clinical study population. Propensity score matching (PSM) t61" is a 
statistical matching technique that attempts to associate subjects of the treatment group with 
similar subjects from the rest of the cohort to form a control group. Matched subjects have 
similar characteristics (as measured by selected covariates), limiting the impact of confounders 
in the estimation of the treatment effect. When time information of events is available, PSM 
can be combined with survival methods such as Cox regression!® to estimate the relative risk 
between the treatment and control arms.!9 

In this work, we propose to model the risk of hospitalization w.r.t treatment for a large 
number of combinations of drugs and diseases. We effectively attempt to emulate thousands 
of clinical trials with hospitalization risk reduction as the endpoint. Our methodology is akin 
to genome-wide association studies (GWAS), in which a simple model is used to estimate the 
effect of a large number of loci in a hypothesis-free manner. As in GWAS, this form of drug- 
disease association study faces the risk of creating spurious relationships and requires further 
analysis, but can be seen as complementary to target-driven repurposing.” We applied our 
method to thousands of drug-disease pairs and showed that we can successfully re-identify 
associations that are already reported in UK Biobank, other observational cohorts, or con- 
trolled clinical trials. To the best of our knowledge, this is the first attempt to apply this type 
of systematic approach for treatment effect modelling. 
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2. Methods 
2.1. Dataset 


The UK Biobank?! (UKBB) is a large observational dataset containing information on approx- 
imately 500K subjects over decades. During their initial visit to the UK Biobank assessment 
center, participants were interviewed about their medication use and completed a detailed 
questionnaire presenting questions on everyday habits, medical history, and mental health 
among others. A total of 1,233,630 treatments were reported, spanning 6745 different medi- 
cations. Other biomarkers such as body mass index (BMI), blood pressure, and grip strength 
were also measured. Moreover, since the beginning of the study, more than 6 million hospital- 
ization events were recorded in the form of an event date and a corresponding international 
classification of disease code (ICD10). 


(1) Covariate Selection (2) Subject Matching (3) Event Risk Modelling 


I Treatment Group 
I Selected Control Group 


x Hospitalization 
2 for CCSR 
R * 
a Outputs : 
x B Estimated treatment 
T effect 
k x p P-value of non-zero 
effect 
Age, sex, BMI, TDI, Identify similar subjects Estimate treatment effect on 
smoking, drinking, based on selected variables hospitalization odds 


common medications and 
medical history 
Repeat for each BNF-CCSR drug-disease pair 


Fig. 1. Overview of the proposed method. It is composed of three main steps and is repeated for 
each BNF-CCSR drug-disease pair. (1) We use common comorbidities as covariates. Additionally, we 
included drugs and medical history, also respectively coded as BNF and CCSR, if they were present in 
more than 20% of the treatment group. We also remove subjects with a history of the CCSR code in 
question. (2) We use propensity score matching to find a similar non-treated subject for each subject 
of the treatment group, based on the selected covariates. (3) We use Cox regression, a proportional 
hazard ratio method, to estimate the treatment effect on hospitalization for the corresponding CCSR. 


2.2. Medication and Disease Selection 


ICD10 is a medical classification list from the World Health Organization used by many health 
organizations around the world that contains codes for over 70K diseases and symptoms. Some 
ICD10 codes represent similar phenotypes, for instance, 150.0, 150.1, and 150.9 correspond to 
congestive, left ventricular, and unspecified heart failure respectively. Using individual ICD10 
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codes for our analysis would be challenging due to the small number of events for each code, 
so we grouped them using Clinical Classifications Software Refined (CCSR)” v2023.1. CCSR 
is a classification system developed by the US Agency for Healthcare Research and Qual- 
ity’s Healthcare Cost and Utilization Project, that aggregates codes into clinically meaningful 
categories. We considered all CCSR categories spanning diseases of the circulatory system 
(CIR), endocrine, nutritional, and metabolic diseases (END) and symptoms, signs, and ab- 
normal clinical and laboratory findings (SYM), as well as some others from diseases of the 
respiratory system (RSP), genitourinary system (GEN) and nervous system (NVS), totaling 
77 phenotypes encompassing 3650 ICD10 codes. 

Most of the medication types recorded in the UK Biobank dataset have very low frequency. 
Additionally, it is common for equivalent or similar active compounds to have different names, 
and no hierarchy is provided. To organize this data in a meaningful way, we mapped each 
medication to a corresponding British National Formulary (BNF) code. This code structure 
is used by the UK’s National Health Service (NHS) to assign codes to drugs and chemicals 
and provides a fine-grained classification based on functionality. We used existing software?’ 
to map 3500 UKBB medications to 151 BNF codes. We only considered codes with at least 
1000 subjects in the treatment group (power analysis with hazard ratio 0.6 and 80% power), 
resulting in 84 total BNF codes for analysis that include 93% of all reported medications in 
the UK Biobank during the first visit. 


2.3. Covariate and Subject Selection 


Medications (represented by BNF codes) and diseases (represented by CCSR codes) pairs 
were evaluated independently and the covariate and subject selection process was repeated 
for each pair. In total, we examined 6468 medication-disease pairs. We selected subjects from 
all available 500K participants who did not have a history of the CCSR code in question. 
Covariates can have a large impact on the estimated treatment effect and should be chosen 
carefully. In an attempt to be as general as possible, we used common demographics and risk 
factors: sex, age, BMI, Townsend deprivation index (TDI)** (related to poverty), smoking 
(current) and drinking habits (three times a week or more) as common covariates for all 
associations. For computational reasons, we capped the maximum number of subjects in the 
treatment group to 30,000, randomly sub-sampling when necessary. 

To produce more precise matching and allow for more potential confounders, we also 
added medical history and medications as covariates if they were present in more than 20% 
of the treatment group. This percentage was evaluated for each individual drug-disease pair. 
Medical history was composed of both self-reported items and ICD10s prior to the first visit, 
grouped by CCSR coding. UKBB self-reported disease codes have their own representation, 
which were mapped to ICD10 and then to CCSR. Other medications were also selected by 
their corresponding BNF codes. This method of covariate selection has the advantage of 
being agnostic to the type of medication being considered. We used the same covariates for 
propensity score matching and Cox regression in all experiments. 
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2.4. Propensity Score Matching and Pair Exclusion 


PSM consists in finding similar subjects in the control and treatment groups. This is done by 
fitting a logistic model and finding pairs of subjects that have the same probability of being 
in the same group. The unmatched subjects from the control group are then discarded. We 
used nearest neighbor distance as our matching method, and PSM was implemented in R 
using the matchit? package. PSM enables the estimation of the average effect of treatment in 
the treated individuals (ATT). In contrast to the average treatment effect (ATE), the ATT 
represents the effect of the drug on the treatment group, rather than the average population. 
As most drugs would not have any beneficial effect on a healthy population, we expect the 
effect of drugs for subjects that are already likely to be on treatment to be a more informative 
measure. 

Despite still being widely used in retrospective studies, PSM has been criticized in the 
past?° for potentially increasing imbalance between treatments and controls. However, this 
imbalance increase is only observed when groups are balanced initially, which was not the 
case in our experiments. Some alternatives to PSM, such as inverse probability of treatment 
weighting (IPTW) and Mahalanobis distance matching (MDM) were not considered due to 
their computational cost. In practice, we found PSM to produce balanced groups with min- 
imal parameter tuning, and to be much more computationally efficient than other tested 
alternatives. 

Additionally, unknown variables can bias the estimation of the treatment effect, to the 
point that the opposite effect can become statistically significant. This issue is not exclusive 
to PSM, and we observed that choosing the appropriate covariates was generally more im- 
pactful than the matching method itself. In some cases, the assignment of the treatment can 
deterministically depend on other variables, resulting in a lack of observation in the control 
group. Since PSM can introduce spurious relations between treatment and controls, careful in- 
terpretation of the treatment effects is always required. We report the number of balanced and 
unbalanced covariates for each pair in the summary statistics (mean standardized difference 
< 0.1). 

We found that in some medication-disease pairs, some matched treatments and controls 
would be extremely dissimilar. Despite the large number of controls, it was simply not possible 
to match some subjects in the treatment groups in some cases. As an example, extremely 
morbidly obese subjects are almost always on the same medications. To address this issue, we 
computed the Huang distance?” between each paired subject and discarded the pairs above 
an arbitrary threshold. The Huang distance was computed using both binary and normalized 
continuous covariates, treating CCSR history, sex, alcohol, and smoking habits as binary 
variables and the rest as continuous. In practice, we found only marginal improvements when 
excluding large-distance pairs. 


2.5. Cox Regression 


The Cox proportional hazard model!’ is a semi-parametric regression technique that estimates 
a relative hazard function, which represents a proportional risk of an event happening at time 
t. 
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The hazard function is of the form: 

A(t] Xi) = Ao(t)exp(X; - B) (1) 
where X; represents the covariate vector for sample i, B are tunable regression coefficients also 
referred to as effect sizes and ào is some common unknown hazard function that vanishes when 
estimating hazard ratios. Instead of binary categories, we considered a subject right-censored 
if no event had yet happened to that subject. 

We fitted a separate model for each drug-disease pair. We considered all hospitalizations 
resulting in an ICD10 code contained in the CCSR category of choice as an event and only 
considered the first event if a subject had multiple events with the same CCSR code. Following 
the advice of Peter Austin,'® we used a robust variance estimator and did not stratify on the 
matched sets. The output of the Cox regression is a treatment effect estimate Br and a 
corresponding P-value for the null hypothesis of a zero effect for the drug-disease pair. The 
Cox regression was implemented in R, using the survival?® package. 

Using only the information from the assessment center, we could not consider how long 
subjects had been on treatment, neither how long they would stay on it, nor the medication 
dosage. We also could not measure if subjects changed treatment over time. To attempt to 
minimize the impact of some of these limitations, we only considered events that happened 
before a given number of years after the assessment center visit and varied this time event 
window to 1, 3, 5, and 10 years. We also experimented with maximum pair Huang distances of 
1, 2, 3, and no cut-off. Finally, we evaluated the impact of including common medical history 
and/or medications in the treatment group as covariates. A graphical overview of the method 
is presented in Figure 1. 


3. Results 


We applied our method to 77 disease categories and 84 medication types, resulting in 6468 
potential drug-disease associations. Our results are reported in Figure 2. When comparing neg- 
ative and positive associations, we observed a clear bias towards unfavorable effects (8r > 0, 
corresponding to increased risk of hospitalization and hazard ratio greater than 1) for all pa- 
rameters, although some configurations are less biased than others. Since these medications 
have been thoroughly tested for safety and side effects, we expect this ratio to be more bal- 
anced. We attribute this imbalance in significant associations to a failure to find appropriate 
matches in PSM, making the control group systematically healthier than the treatments. 

We found that the bias towards unfavorable effects did not vanish when reducing the pair 
Huang distance cut-off, implying that this discrepancy is due to non-observable variables. 
When inspecting significant associations, we found that drugs that were already used as a 
treatment for a CCSR category were consistently associated with a higher hospitalization risk 
for the same CCSR. Our explanation for this observation is that treatment was prescribed 
to high-risk subjects without any hospitalization event or self-report, making the treatment 
group inherently more at risk than matched controls. 

As an example, anti-diabetic drugs (BNF 6.1.2) were consistently associated with a higher 
risk of diabetes (CCSR END002). This is likely due to the fact that we could not match for pre- 
diabetes effectively, and thus the treatment group was much more likely to end up diabetic than 
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Fig. 2. Volcano plots of effect estimates for each drug-disease pair, spanning 84 medication categories 
and 77 phenotypes for multiple parameter choices. Gr: Cox regression coefficient; positive values in- 
dicate unfavorable effects (increased hazard ratio for hospitalization). Non-significant (p > 0.05) 
associations are reported in grey. Medications associated with a reduced or increased risk of hos- 
pitalization for the corresponding disease are reported in blue and red, respectively. A: Matching 
only with common covariates: sex, age, BMI, TDI, smoking, and drinking habits with an event time 
limit of 3 years. B: Matching with common covariates and medical history with an event time limit 
of 3 years. C: Matching with common covariates and other medications with an event time limit 
of 3 years. D: Matching with common covariates, medical history, and other medications with a 
maximum event time limit of 1 year post-visit. E : 3 year limit. F : 10 year limit. 
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the matched controls. Based on the previous results, we chose the combination of covariates, 
time, and Huang distance cut-offs that would result in the most balanced number of total 
associations (Maximum Huang distance of 3, Maximum time-to-event of 3 years, and including 
both medication and medical history). We used this configuration for all further analysis. 
Based on power analysis (60% hazard ratio and 80% power), we automatically discarded drug- 
disease pairs that included less than 100 events. We were able to estimate effects for 1013 
pairs, and we used this number for correcting for multiple testing. 

As we estimate the ATT, the correct interpretation of these measurements is that treated 
subjects would have a different risk of the corresponding CCSR code had they not taken the 
treatment, after correcting for all other known covariates. The root cause of this risk reduction 
cannot be inferred, and additional analysis is always required to determine the clinical rele- 
vance of this measured effect. Similarly, our method is also capable of measuring side effects 
that manifest as increased risks of hospitalization. We report all statistically significant effects 
after correcting for multiple testing (Bonferroni correction on the number of drug-disease pairs 
tested) in table 1, ordered by statistical significance. We expand further on each pairing in 
the next sub-sections. 


Table 1. All statistically significant medications associated with a reduced risk of hospitalization for 
the corresponding disease (p < 5-10~°, after Bonferroni correction for multiple comparisons), ordered 
by P-value. 


BNF CCSR Medication Disease Hazard Ratio P-value 
Cardioselective COPD and 
2AA RSPO08 Beta-blockers bronchiectasis 094 AR 
067 “ENDOIO Multivitamins Oe 0.56 4.5e-12 
lipid metabolism 
10.1.5 END002 Glucosamine Droben 0.63 3e-10 
mellitus 
2.2.1  CIR011 Thiazides Coronary 0.69 7.9e-10 
atherosclerosis 
ids ENDOG -Ciosi n a a 0.73 1.9e-07 
lipid metabolism 
2.2.1  CIR019 Thiazides Heart 0.55 2.80-07 
failure 
9.6.7  CIROO7 Multivitamins pesca 0.7 2.7e-05 
hypertension 


3.1. Cardioselective Beta-blockers and COPD 


Beta-adrenergic blocking agents or beta-blockers (BNF 2.4) used for COPD (CCSR RSP008) 
constitute one of our most significant positive drug-disease pairs (8r = —0.56,p = 107!°) witha 
corresponding hazard ratio for hospitalization risk of 0.57. Historically, the use of beta-blockers 
was discouraged for COPD as non-selective beta-blockers can reduce lung function.?? Never- 
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theless, several retrospective observational studies have shown that usage of beta-blockers can 
reduce mortality and other exacerbations in COPD.°0:3! 

The beta-blocker BNF encoding does not separate between cardioselective and non- 
selective compounds, so we split this category into two, 2.4.1 and 2.4.2 for cardioselective 
and non-cardioselective beta-blockers respectively. We found a stronger effect and smaller P- 
value (Br = —0.62, p = 2.9 - 10712) for category 2.4.1 w.r.t 2.4 while results of non-selective 
beta-blockers were not significant; this corroborates recent observational findings.??°? Thus, 
our results agree with the consensus that cardioselective beta-blockers are not only safe for 
patients at risk of COPD but could also reduce their risk of hospitalization.*4 The effect of 
cardioselective beta-blockers for patients with COPD is the subject of an ongoing phase IV 
clinical trial (NCT03566667).*° 


3.2. Glucosamine and Diabetes Mellitus 


Glucosamine is a widely used supplement for osteoarthritis that is often taken daily and has 
anti-inflammatory properties. While glucosamine has been shown to induce insulin resistance 
in rodents*® this effect does not appear to be present in humans.®” Nevertheless, similarly to 
our findings, another recent UK Biobank study also showed the potential of glucosamine for 
the prevention of diabetes.!* Since glucosamine does not impact blood sugar levels, glucose 
tolerance, or insulin resistance, this effect is likely not direct. However, there is an established 
relation between inflammation and the occurrence of diabetes,*® and even support for inflam- 
matory pathways to be involved in its pathogenesis.” The anti-inflammatory properties of 
glucosamine and the reduction of symptoms of arthritis might explain its apparent benefits 
for diabetes. We also found a reduced risk of hospitalization for disorders of lipid metabolism, a 
CCSR category that includes different types of hypercholesterolemia and hyperlipidemia (cor- 
responding to ICD10 E78). Thus, long-term glucosamine supplementation might be beneficial 
for the prevention of diabetes and other metabolic diseases. 


3.3. Multi-vitamin Supplementation 


There is mixed evidence for the benefits of multi-vitamin (MVM) supplementation for general 
health,*°4! with a general consensus from clinical trials that MVM supplementation does not 
reduce CVD mortality. Recently, Che et al.*? found that multivitamin/mineral supplementa- 
tion was associated with a modest reduction in CVD events in the UK Biobank. In contrast, 
we find that MVM supplementation is associated with a substantial reduction in risks of 
disorders of lipid metabolism and essential hypertension. 

We suspect that the average MVM user in UK Biobank is more health-conscious than their 
matched counterparts or has had MVM and other supplementations for a long time before 
their visit to the assessment center, thus biasing our estimates. Subjects were matched for their 
history of hypertension, use of non-opioid analgesics, lipid-regulating drugs, and glucosamine 
in addition to the common covariates. Adding other confounding variables such as diet and 
exercise might reduce the estimated effect of MVM, although we leave this analysis for future 
research. 
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3.4. Thiazides and Heart Failure 


We report that thiazides, a family of diuretics, are associated with a reduced risk of hospital- 
ization for coronary atherosclerosis and heart failure. This coincides with the results reported 
in previous studies such as the SPRINT clinical trial, which showed the importance of 
intense systolic blood pressure management for the risk reduction of cardiovascular events. 
Additionally, more than half of heart failure cases have a history of hypertension.*4 

When inspecting the treatments and matched controls, we found that only approximately 
5% of the control group was on some form of non-thiazide diuretic. The proportion of other 
blood pressure medications such as beta-blockers were otherwise similar. 98% of the treatment 
group had a history of hypertension, while the ratio for the control group was 96%. It is possible 
that the observed reduction in hospitalization risk might generalize to other types of diuretics. 

Interestingly, we observed an opposite effect for loop diuretics (LD, BNF 2.2.2) and heart 
failure. As the number of subjects on thiazides was significantly larger than the LD group, the 
LD treatment group was matched with a large proportion of subjects on some other diuretic, 
which was not the case for thiazides. Furthermore, LD are also more likely to be already used 
for the management of heart failure, thus biasing our estimates. 

Recent studies support the use of thiazides for the treatment of heart failure. Using data 
from the SPRINT study, Tsjimoto et al. found that thiazides decreased the risk of events 
for heart failure in non-diabetics. In the CLOROTIC trial*® the combination of thiazides 
and LD proved to be effective for the treatment of acute heart failure. Unfortunately, only 
approximately one hundred subjects used both thiazides and LD in our dataset, making the 
estimation of the effect of the combination of both treatments unfeasible. Nevertheless, our 
results underline the importance of hypertension management for the prevention of heart 
failure and the potential of thiazide diuretics. 


3.5. Other Associations 


We found 92 statistically significant associations (after Bonferroni correction) for medications 
that increase the risk of hospitalization (67 > 0, p < 5-107). Four medication types included 
52 of these associations, all of which are reported in Table 2. As previously explained, some 
of these associations are known to be spurious, for instance, aspirin (BNF 4.7.1) does not 
cause an increased risk of hospitalization for hypertension (CIR007). However, since aspirin is 
commonly prescribed to individuals at risk of hypertension and other diseases it is associated 
with the phenotype in our analysis. We observe a similar effect for loop diuretics and multiple 
cardiovascular diseases. 

We also observed that proton pump inhibitors (PPIs, BNF 1.3.5) were associated with 
an increased risk for 23 diseases. We offer three potential explanations. 1) Subjects on PPIs 
have systematically poorer health than their matched counterparts, either due to unknown 
variables or scarcity of suitable matches in the control group. 2) PPIs are used in the treatment 
of multiple diseases in the list or other related comorbidities, thus biasing our estimates. 3) 
PPIs have measurable side effects and increase the risk of hospitalization for multiple diseases. 
Since PPIs are used for gastric acid-related disorders and have several known side-effects,*”** 
it is plausible for some of these associations to be causal. Further analysis would be required 
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to estimate the causal effect of PPIs on these diseases, either by Mendelian randomization or 
a controlled study. We also come to a similar conclusion for anti-epileptic drugs (BNF 4.8.1), 
although the probability for these associations to be causal is lower. 


Table 2. Medication associated with a higher risk of hospitalization (Br > 0, p < 5-10~°) for the cor- 
responding CCSRs, ordered by P-value from left to right. CIR: Diseases of the circulatory system; SYM: 
Symptoms, signs and abnormal clinical and laboratory findings, not elsewhere classified; RSP: Diseases 
of the respiratory system; GEN: Diseases of the genitourinary system; END: Endocrine, nutritional and 
metabolic diseases 
Drug Category CCSRs 
CIR007 (p =3.4e-36), SYM006, ENDO10, CIRO11, RSP008, SYM016, END009, 
SYM012, CIR012, ENDO07, GEN003, GEN002, ENDO11, SYMO001, CIR031 
GENO01 , CIR026, SYM010, END002, SYM013, SYM014, SYM017, CIRO16 
~GEN003 (p =4e-22), CIRO19, GEN002, CIR016, ENDO11, CIR003, SYMO16 
GEN001, CIR015, CIRO11, END002, CIR031 
~ Control of | SYM016 (p =1.8e-10), END011, SYM010, RSP008, CIR007, GEN003, SYM012 — 
Epilepsy SYMO015, SYMO01 
Non-opioid = CIROO7 (p =1e-26), END010, CIRO11, RSP008, END002, CIR012, CIR026, 
Analgesics END009 


Proton Pump 
Inhibitors 


Loop Diuretics 


We also found 58 other risk-lowering associations (G7 < 0, p < 0.05) that were not sta- 
tistically significant after correcting for multiple testing. Several of these associations were 
also reported in the literature and could have potential clinical applications. Anti-diabetic 
drugs (BNF 6.1.2), mostly composed of metformin (86% of treated subjects) and blood sugar 
lowering medications, were associated with reduced risk of 7 disease categories including con- 
duction disorders (p = 0.0023), cardiac dysrhythmias (p = 0.034) and heart failure (p = 0.047). 
This corroborates the known cardiovascular benefits of metformin and other anti-diabetic 
drugs.49-51 


4. Discussion 


In this work, we proposed a purely phenotypic screening approach for drug repurposing that 
consists in systematically measuring medication effects on hospitalization risk from observa- 
tional data. We showed that we could re-identify known repurposing candidates using simple 
extensively tested techniques for causal inference and time-to-event modelling. Grouping drugs 
and diseases by functionality allowed us to gather enough events to estimate potential effects 
while keeping fine-grained categories. We estimated the risk of hospitalization, making our 
method inherently preventive although some results could generalize to already hospitalized 
patients. While our results mostly corroborate known associations, the data for this study has 
been available for ten years and this method can be applied to new cohorts and treatments. 
Due to the nature of the examined data, our study presents multiple limitations. The 
generally low frequency of events for each CCSR code made the estimation of most effects 
impossible. While more events could have been included by increasing the time event limit, 
this would have also introduced more spurious associations. Without utilizing general provider 
longitudinal data, we could not estimate the approximate dosage, length of treatment, or 
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whether subjects swapped treatments after the first visit and we found a maximum time from 
visit of 3 years to be a good compromise. Medications were self-reported and no corresponding 
indication was provided. While matching for common medication has shown to produce less 
imbalance in general, it can also be counterproductive in cases where a single drug is used 
for multiple purposes and can result in inadequate matching. The quality of the matching 
itself is difficult to quantify as most of the bias comes from unmeasured variables, or due to 
irreconcilable differences between control and treatment groups. 

Despite these limitations, biobanks have multiple advantages over typical EHR datasets. 
1) All measurements were taken with the same methodology by a small number of assessment 
centers. 2) Measures such as BMI were taken at a single time point, making time-to-event 
analysis straightforward. In contrast, EHRs typically have a large portion of missing variables 
and information is spread over multiple records. 3) Subjects directly described in detail their 
medication intake and medical history. These variables would be more challenging to recover 
with EHR data and would likely be incomplete, as the subject history must be stitched up from 
past events. UK Biobank data allowed us to perform time-to-event analysis with relatively little 
pre-processing, and scaling up to thousands of tests was also straightforward to implement. 
To the best of our knowledge, we are the first to report associations for cardioselective beta- 
blockers, thiazides, and proton pump inhibitors in the UK Biobank. 

Large-scale biobank data are a precious resource for understanding human health. While 
retrospective analysis is always biased and incomplete, it can be an effective tool to guide 
the design of future experiments that is complementary to other DR methods. Our proposed 
approach is especially effective at identifying repurposing candidates for preventive care of 
high-risk subjects. In the future, we plan on using longitudinal general provider prescription 
data to refine our estimates. 


5. Code and Data Availability 


Code used for the analysis and summary statistics for all drug-disease pairs in this manuscript 
is provided on a dedicated GitLab repository https://gitlab.com/CGeorgantasCHUV/ 
SYESTE. 
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A single gene can produce multiple transcripts with distinct molecular functions. Rare-variant 
association tests often aggregate all coding variants across individual genes, without accounting for 
the variants’ presence or consequence in resulting transcript isoforms. To evaluate the utility of 
transcript-aware variant sets, rare predicted loss-of-function (pLOF) variants were aggregated for 
17,035 protein-coding genes using 55,558 distinct transcript-specific variant sets. These sets were 
tested for their association with 728 circulating proteins and 188 quantitative phenotypes across 
406,921 individuals in the UK Biobank. The transcript-specific approach resulted in larger estimated 
effects of pLOF variants decreasing serum cis-protein levels compared to the gene-based approach 
(Pbinom < 2x10°'°). Additionally, 251 quantitative trait associations were identified as being significant 
using the transcript-specific approach but not the gene-based approach, including PCSKS5 transcript 
ENST00000376752 and standing height (transcript-specific statistic, P = 1.3x10"'°, effect = 0.7 SD 
decrease; gene-based statistic, P = 0.02, effect = 0.05 SD decrease) and LDLR transcript 
ENST00000252444 and apolipoprotein B (transcript-specific statistic, P = 5.7x10°, effect = 1.0 SD 
increase; gene-based statistic, P = 3.0x10“, effect = 0.2 SD increase). This approach demonstrates 
the importance of considering the effect of pLOFs on specific transcript isoforms when performing 
rare-variant association studies. 


Keywords: UK Biobank; rare variant; transcriptome; quantitative traits 


1. Introduction 


Alternative splicing allows for one gene to produce many transcript isoforms. When these isoforms 
differ in their coding sequence content, they can result in proteins with distinct molecular functions. 
Over 95% of protein-coding genes are alternatively spliced! which contributes to the large diversity 
of the human transcriptome and proteome. This process is instrumental in creating the complex and 
coordinated gene expression patterns that underlie all biological processes. 

Alterations to the transcriptome by genetic variation is instrumental in driving differences in 
phenotypic expression. Many of these disruptions have been identified through genome-wide 
association studies (GWAS), which test the impact of common single nucleotide variants (SNVs) 
on phenotypes on a population scale. Studies such as these are critical in drug-discovery efforts, as 
they can be used to identify new therapeutic targets for disease. Additionally, lack of genetic 
validation of therapeutic hypotheses has been shown to reduce the likelihood of a successful clinical 
trial”, suggesting that genetically validated targets are essential in the development process. 

A large amount of phenotype heritability is not well captured through common-variant GWAS 
alone*. Rare, coding SNVs can be exceptionally disruptive to the transcriptome and have dramatic 
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effects on phenotypes, even more so than common variants’. Rare variant association studies 
(RVAS) provide avenues to explain the “missing heritability” of traits®, and provide a 
complementary approach to common-variant GWAS. Though, assessing genotype-phenotype 
associations with these low-frequency SNVs is difficult due to lack of sufficient sample size and 
Statistical power. 

One approach in ameliorating the statistical challenges of rare variant analysis is the aggregation 
of SNVs with similar predicted functional consequences, known as burden testing”*. For example, 
an analysis may collect rare protein-truncating variants (PT'Vs), also known as predicted loss-of- 
function variants (pLOF), that are expected to result in non-functional gene products through 
nonsense-mediated decay’. These pLOF variants can then be aggregated and tested for their 
collective association with phenotypes of interest. This allows for an increase in statistical power 
and ability to detect genotype-phenotype associations which would otherwise be impossible at the 
level of single-variant tests. 

However, burden testing assumes that all aggregated variants will have a similar effect on the 
function of the gene and, consequently, the associated phenotype. This assumption does not hold if 
a gene has multiple transcript isoforms with diverse downstream functions. For example, where a 
given SNV may encode a missense variant that is deleterious in some encoded protein isoforms but 
not others, or where an SNV may encode a variant of any function that overlaps some transcript 
isoforms but is not transcribed in others. The most common techniques for creating variant sets for 
burden testing consider the most deleterious consequence of an SNV across all documented 
isoforms. Subsequently, the expected impact of the SNV may be overestimated. Due to these 
challenges, we propose the inclusion of transcript-aware analyses when studying rare variants, in 
addition to the standard gene-based approach. 

Our analysis uses whole-exome sequencing data from the UK Biobank to perform transcript- 
specific burden analyses on 406,921 individuals of European ancestry. Rare pLOFs were identified 
across 17,035 genes and aggregated by transcript, resulting in 55,558 unique, transcript-specific 
variant sets tested against the circulating levels of 728 cis-encoded proteins and 188 quantitative 
traits. The results of the transcript-specific burden tests were compared to the results from the 
maximally inclusive, standard, gene-based burden method. 


2. Data 


2.1. UK Biobank 


The UK Biobank consists of approximately 500,000 volunteer participants, who were aged 40—69 
years when recruited between 2006 and 2010!°!'. Both array genotyping and whole-exome 
sequencing have been performed on most of these participants!”. Data from genotyping, sequencing, 
questionnaires, primary care data, hospitalization data, cancer registry data, and death registry data 
were obtained through application number 26041. Proteomic profiling was also performed on a 
subset of participants through application number 65851!°. Ethical oversight for the UK Biobank is 
provided by an Ethics and Governance Council which obtained informed consent from all 
participants to use these data for health-related research. Data management and analytics were 
performed using the REVEAL/SciDB translational analytics platform from Paradigm4. 
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2.2. Variant calling and definition 


The source of genetic data for the main analysis was exome sequencing data. DNA from whole 
blood was extracted and sequenced by the Regeneron Genetics Center (RGC) using protocols 
previously described'*. Of the variants called by RGC, additional quality-control filters were 
applied: Hardy-Weinberg equilibrium (among the European subpopulation, as defined by Pan- 
UKB!5) P > 1x10!°, and missingness across all individuals less than 2%. Variants were annotated 
using ENSEMBL Variant Effect Predictor (VEP)! version 109.3, using the LOFTEE plug-in 
version 1.0.4 to identify high-confidence predicted PTV variants? in protein-coding genes with 
minor allele frequency <1%. Bcftools was used to filter variants with genotype quality (GQ) > 20 
and depth (DP) > 7 or 10 for SNPs and indels respectively. For the gene-based burden, variant 
effects were scored against all available transcripts in ENSEMBL, and the most severe predicted 
impact was retained. Variants were aggregated in each protein-coding gene as follows: pLOF 
variants were defined as “HC” (high confidence) from LOFTEE and their most severe consequence 
from VEP as “stop gained,” “splice donor,” “splice acceptor,” or “frameshift.” For the transcript- 
based burden, the consequence for each variant was assessed individually by transcript. 


2.3. Participant definition for overall analyses 


An initial round of quality control was performed by RGC, which removed subjects with evidence 
of contamination, discrepancies between chromosomal and reported sex, and high discordance 
between sequencing and genotyping array data. A European ancestry population was defined using 
data from the Pan-UKB Team!°, resulting in a set of 406,921 European ancestry individuals with 
exome sequencing data available. Two sets of genetic principal components (PCs) were defined, as 
described by Backman et al'’: a set derived from common array variants, of which 10 were used, 
and a set derived from rare exome variants, of which 20 were used. Rare exome derived PCs were 
calculated by applying the following filters on variants on the autosomes: MAF > 2.6x10° and < 
0.01, Hardy-Weinberg equilibrium P > 1x10`!?, and genotype missingness < 2%. Regions of high 
LD were removed, and SNPs were pruned with PLINK’s!® indep-pairwise function, using a 
window-size of 1,000 base pairs, a step size of 100 base pairs, and an R° threshold of 0.1. Indels 
were removed, then R’s Smart PCA was implemented to derive the PCs. Array derived PCs for the 
European subset were derived by imposing a MAF filter > 0.01 and INFO score = 1 before running 
Smart PCA. 


2.4. Phenotype sources 


The main source of phenotype data was from a release of structured data by the UK Biobank Data 
Showcase on December 22, 2022. We tested 188 quantitative phenotypes, including physical 
measures, blood counts, metabolomics, touchscreen questionnaire responses on family history, 
telomere length, and urine biochemistry. Quantitative traits were rank-based inverse normal 
transformed to have a mean of zero and a standard deviation of one. 
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2.5. Tissue-expressed transcript isoforms 


GTEx version 8 bulk RNAseq data was aggregated across 54 tissue types from 948 donors. For each 
gene, expression was calculated across all tissues, identifying 145,219 transcripts with mean TPM 
expression > 0. 


2.6. Olink proteomics 


Characterization of 1,463 proteins across 54,306 individuals was undertaken by the UK Biobank 
Proteomics Project (UKB-PPP). Proteomic profiling was conducted across four panels utilizing the 
Olink Explore Assay. Sample collection, preparation, data pre-processing, and quality control is 
described in detail in Sun et al’. Quantified protein expression levels were rank-based inverse 
normal transformed to have a mean of zero and standard deviation of one. 


3. Methods 


3.1. Transcript-specific variant set curation 


Rare, predicted loss-of-function (pLOF) variants sets (MAF < 1%) were created across 145,219 
transcripts with mean TPM > 0 across all 53 GTEx tissue types. Overall, 72,769 transcripts had at 
least one overlapping rare pLOF variant. Identical variant sets that were representative of more than 
one transcript were combined into a single label, resulting in 55,558 unique transcript-specific 
variant sets across 17,035 genes. 


3.2. Whole-genome ridge regression analysis 


REGENIE v3.1.1!° was used to perform a whole-genome ridge regression taking subject relatedness 
into account, while using a Firth approximation to estimate P values. For all quantitative traits, 
REGENIE was performed using an additive model across the entire European-ancestry population, 
including related individuals, controlling for age, sex, age’, age x sex, age” x sex, 10 rare-variant 
derived principal components, and 20 common-variant derived principal components. For the Olink 
proteomics, batch numbers 1-7 were added as one-hot encoded covariates. 


3.3. Comparison of estimated effect sizes by approximating a binomial distribution 


The effect sizes across transcript and gene-based burden tests were compared in cases only where 
there was a significant association for a quantitative phenotype in both methods. Deviation from a 
binomial distribution was modeled using R’s binom.test() to determine is the proportion of results 
with stronger associations in the transcript-based model differs from the null hypothesis. 


3.4. Binary case-control phenotype regression 


As a follow-up to the quantitative traits analysis, we tested a single binary phenotype, Alzheimer’s 
disease, across multiple TREM2 transcript-specific variant sets. Diagnoses were extracted from 
inpatient hospital diagnoses, the cancer and death registries, primary care, and self-reported data. 
We adjusted for age, sex, age”, age x sex, age? x sex, 10 rare-variant derived principal components, 
20 common-variant derived principal components, availability of primary care, and country of 
recruitment. 
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4. Results 


4.1. Transcript-specific variant sets show stronger associations with lower serum cis-protein 
levels 


To evaluate the validity of transcript-specific pLOF variant sets, they were first tested for their 
association with cis-encoded proteins. Variant sets with at least 10 carriers were tested across 728 
circulating serum proteins in 47,297 individuals of European ancestry and compared to the gene- 
based approach. Several gene and transcript-specific variant sets were identical, and their removal 
resulted in 913 unique transcript variant sets tested across 432 serum protein levels. Among 580 
results that were significant for both the transcript and gene-based burden approach, 75% (N = 437) 
had lower effect estimates on cis-serum proteins in the transcript-based burden (Figure 1), which is 
substantially greater than expected by chance (ppinom < 2x10°!°). Of the 437 transcript-based results 
with lower cis-protein effect estimates, 45 had non-overlapping 95% confidence intervals with the 
effect estimates of the gene-based approach. 


Estimated effect size: 
è Lower in gene-based burden 


è Lower in transcript-based burden 


Effect size transcript-based burden 


-1 0 1 
Effect size gene-based burden 


Figure 1. Comparison of estimated effect sizes on circulating serum proteins. Each dot 
represents an association of the transcript or gene-based burden with a cis-encoded protein. 


4.2. Some pLOF-cis protein associations are only detectable using transcript-specific variant 
sets 


The transcript-based burden on cis-proteins resulted in 35 associations across 21 loci that were non- 
significant in the gene-based burden. Of these associations only significant in the transcript-based 
burden, 22 associations across 12 loci had non-overlapping 95% confidence intervals with the gene- 
based approach (Table 1), and all of them had lower estimated effect sizes. 
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N 
P value Effect carriers P value Effect size N carriers 
Gene/cis- gene - size gene gene - 95% CI transcript- transcript - transcript- 95% CI 
protein based -based based gene -based based based based transcript- 
burden burden burden burden burden burden burden based burden Transcripts 
ENST00000326165; 
CD300LF 5.3x10'! -0.03 195 -0.1,0.1 3.6x10 -1.0 38 -1.2,-0.8 ENST000004649 10; 
ENST00000583937 
ENST00000368048; 
CD84 4.12x105 -0.2 51 -0.3,-0.1 1.3x107° -1.0 10 -1.2,-0.8 ENST00000368051; 
ENST00000368054 
-4 -18 ENST00000416562; 
CLECIOA 3.0x10 -0.1 103 -0.2,-0.1 71x10 -1.0 13 -1.2,-0.7 ENST00000571664 
CPPED1 4.6x107 -0.2 43 -0.5,0.01 3.2x10°3 -1.6 13 -2.0,-1.1 ENST00000381774 
MSRI1 1.4x10° -0.1 110 -0.2,-0.02 5.0x108 -0.8 23 -1.0,-0.6 ENST00000262101 
MSRA 5.9x10° -0.2 93 -0.4,-0.1 7.3x10% -1.3 13 -1.7,-0.8 ENST00000528246 
NRP2 1.8x10"! 0.1 32 -0.04,0.2 2.6x10°\4 -0.7 13 -0.9,-0.5 ENST00000357785 
-2 -10 ENST00000359337; 
PLXNB2 1.5x10 -0.1 85 -0.1,-0.01 4.9x10 -0.4 23 -0.5,-0.3 ENST00000449103 
SETMAR 5.8x107 -0.1 133 -0.1,0.02 1.2x10 -0.3 51 -0.4,-0.2 ENST00000425863 
TREM2 8.8x10° -0.2 66 -0.3,-0.04 2.2x1076 -1.2 17 -1.5,-0.9 ENST00000373113 
ENST00000503506; 
ENST00000524698; 
TXNRDI 5.4x10° -0.4 35 -0.6,-0.2 1.3x10° -1.0 12 -1.3,-0.7 ENST00000526390; 
ENST00000526950; 
ENST00000529546 
TYMP 2.6x107 -0.2 54 -0.4,-0.02 2.6x10° -0.9 13 -1.3,-0.5 ENST00000425 169 


Table 1. Transcript-specific results with significant association on circulating cis-proteins, and transcript- 
based burden 95% CI not-overlapping with gene-based burden 95% CI. Multiple transcripts listed when 
variant sets are identical. 


From these data, we focused on TREM2 as it has a known role in Alzheimer’s disease (AD) risk. 
TREM2 is primarily expressed in microglia, and rare loss-of-function mutations including the 
missense variant R47H have been shown to increase AD risk”’. When testing TREM2 transcript- 
specific pLOF variant sets (Figure 2), we observe more significant associations with larger 
reductions in serum TREM2 levels in the ENST00000338469 and ENST00000373113 models, 
compared to ENST00000373122 or the gene-based method (Table 2). 

The primary variant that explains the difference in signal is rs538447052, a splice acceptor 
variant at the boundary of exon 4. The canonical transcript with the highest brain expression”!, 
ENST00000373113, and ENST00000338469, are both unaffected by rs538447052 as it functions 
there as an intron variant. By excluding rs538447052 from these variant sets, we see a much stronger 
association with decreasing serum TREM2. 

Next, we tested the relationship between Alzheimer’s disease and TREM2 and its transcript 
isoforms. Our analysis is limited by a low number of affected carriers; however, we detect an 
enrichment of AD cases when using the more stringent TREM2 transcript models, 
ENST00000373113 and ENST00000338469 (Table 2). This association is absent in the 
ENST00000373122 and gene-based models and is consistent with the weaker observed effects on 
serum TREM2. 
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3 r 13538447052 ‘ 
ENST00000373122 
ENST00000373113 
ENST00000338469 
mm 7 E ma HB inclusive model 
Figure 2. TREM2 transcript models and the gene-based, inclusive model. 
N P value Effect size 95% CI N carriers 
P value Odds ratio 95% CI N carriers carriers serum serum serum serum 
Transcripts AD AD AD AD with AD TREM2 TREM2 TREM2 TREM2 
ENST00000338469 8.9x10° 10.3 2.6,40.9 48 2 7.2x10°° -1.4 -1.0,-1.7 12 
ENST00000373113 1.2x107 9.1 2.3,35.9 55 2 2.1x107° -1.2 -1.4,-0.9 17 
Inclusive model 9.6x107 1.0 0.3,3.1 435 3 8.7x107 -0.2 -0.3,0.0 66 
ENST00000373 122 9.8x107 01.0 0.30,3.1 428 3 4.9x107 -0.2 -0.3,0.0 61 


Table 2. TREM2 transcript-specific associations with AD and circulating TREM2 levels. 


4.3. Some pLOF-cis protein associations have opposite directions of effect in the transcript 
and gene-based models 


Most effect size estimates maintain their direction of effect when comparing the gene and 
transcript-based methods. However, six associations from three loci resulted in opposing estimated 
effect sizes (Table 3). In all six cases, the transcript-variant set of pLOFs associates with lower 
serum cis-protein levels, as expected, while the gene-based method associates with higher serum 
cis-protein levels. 


Effect 
size N carriers P value Effect size 95% CI N carriers 
P value gene gene- 95% CI gene - transcript- transcript- transcript transcript- 
Gene/cis- -based based gene-based based based based -based based 
protein burden burden burden burden burden burden burden burden Transcripts 
-69 -37 ENST00000265016; 
BST1 8.8x10 0.4 0.4,0.5 543 6.6x10 -1.1 -1.3,-01.0 37 ENST00000382346 
BST1 8.8x 10° 0.4 0.4,0.5 543 1.1x10°° -1.2 -1.4,-1.0 34 ENST00000505785 
BST1 8.8x 10° 0.4 0.4,0.5 543 2.7x10 -1.4 -1.6,-1.1 15 ENST000005 14445 
GPNMB 2.9x107! 0.2 0.1,0.3 446 1.1x107!? -1.1 -1.1,-1.0 93 ENST00000409458 
-21 -8 ENST00000570445; 
HMOX2 2.2x10 1.0 0.8,1.2 26 5.2x10 -0.9 -1.2,-0.6 11 ENST00000575051 
ENST00000574466; 
HMOX2 2.2x107! 1.0 0.8,1.2 26 1.4x107 -0.9 -1.3,-0.6 10 ENST00000575 129; 
ENST00000576827 


Table 3. Significant transcript-based burden results with opposing effect sizes compared to the gene- 
based burden. Multiple transcripts are listed when the variant sets are identical. 


The difference in variants captured by the BSTJ, GPNMB, and HMOX2 gene-based and 
transcript-based variant sets are primarily attributable to variants missing from the terminal exon 
(Figure 3). The most significantly associated transcript variant set for each locus mainly exclude a 
single, frequent variant from the last exon, rs 144539516, rs11537976, and rs1 1537976, respectively. 
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5’ 3’ ENST00000265016 


eet! g inelusive model 


ENST00000409458 
rs11537976 %& i 
I l rT Inclusive model 


5’ HMOX2 3’ 
ENST00000570445 
13146410700 % 


| E | | iE I jH Inclusive model 
Figure 3: BST], GPNMB, HMOX2 inclusive gene-based model and representative transcript-based 
models. 


Each of these excluded variants strongly associate with increased cis-serum protein levels when 
tested individually (Table 4). Rs144539516 and rs146410700 are 3’ UTR variants in at least one 
transcript, which may affect the post-transcriptional stability of the RNA product. Rs146410700 
also occasionally is identified as a missense variant in some transcripts and could influence protein 
stability, detectability, and post-translational regulation. Rs11537976 acts as a non-coding exon 
variant and may affect transcriptional regulation. In all instances, this provides an explanation for 
the unexpected gene-level association with increased protein. 


Cis-protein Rsid P value Effect size 95% CI N carriers 
GPNMB 1s11537976 1.5x107%9 0.6 0.5, 0.6 318 
BST1 18144539516 7.6x 1071 0.5 0.5, 0.6 506 
HMOX2 rs146410700 1.2x10°! 3.4 3.0, 3.7 11 


Table 4: Single variant cis-protein association results for BST7, GPNMB, HMOX2 variants rs11537976, 
rs144539516, and rs146410700 


4.4. Transcript-specific variant sets show stronger associations with quantitative traits 


Since the transcript-based variant sets show larger effects on circulating cis-proteins compared to 
the gene-based method, we next extended the analysis to quantitative traits. Transcript-specific 
pLOF variant sets with at least 10 carriers were tested for their association with 318 quantitative 
traits in 406,921 individuals of European ancestry and compared to the gene-based approach. After 
removing identical results between the transcript and gene-based approach, 6,981,491 transcript- 
trait and 2,740,011 gene-trait association tests were performed (Bonferroni corrected P value < 
5.1x10°). Among 1,010 associations that were significant in both the transcript and gene-based 
approach, 73% (N = 745) had more extreme effect sizes in the transcript-specific approach (Figure 
4), which is substantially larger than expected by chance (Ppinom < 2x10°!°). Of these, 75 had non 
overlapping 95% confidence intervals with the gene-based approach. Additionally, 46% of 
associations significant in both methods were more significant in the transcript-approach despite 
having a lower number of tested carriers in practically all instances. 
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Figure 4. Comparison of estimated effect sizes on 188 quantitative traits for the transcript and 
gene-based burden. Each dot represents an association of the transcript or gene-based burden 


with a quantitative trait. 


4.5. Transcript-specific variant sets elucidate novel transcript-trait associations 


We identified 241 associations across 60 loci as being significant in the transcript-based approach 
but not in the gene-based burden. Of these, 56 transcript-trait associations had effect estimates with 
non-overlapping 95% confidence intervals with the gene-based burden (Table 5). These include 
PCSKS transcript ENST00000376752 and standing height (transcript-specific statistic, P = 1.3x107 
16 effect = 0.72 SD decrease; gene-based statistic, P = 0.02, effect = 0.05 SD decrease) and LDLR 
transcript ENST00000252444 and apolipoprotein B (transcript-specific statistic, P = 5.7x107°, 
effect = 1.0 SD increase; gene-based statistic, P = 3.0x104, effect = 0.2 SD increase). These data 
reflect genotype-phenotype associations that would have been otherwise undetected if testing only 
the standard, gene-based burden. 


Effect Effect 

P value size N carriers 95% CI P value size N carriers 95% CI 

gene- gene- gene- gene- transcript- transcript- transcript- transcript- 

based based based based based based based based 
Gene Phenotype burden burden burden burden burden burden burden burden Transcripts 
EPB41 ener 8.5x10° 0.33 282 0.2,0.4 6.9x10% 11 77 0.8,1.3 ENST00000373800 

percentage 

LDLR oe 3.0x104 0.17 425 0.1,0.3 5.8x10 1.0 78 0.8,1.2 ENST00000252444 
SCUBE3 Standing height 1.2x10*4 -0.12 373 -0.2,-0.1 1.4x10!8 -0.6 71 -0.8,-0.5 ENST00000274938 
EPB41 Total bilirubin 1.5x10*4 0.19 275 0.1,0.3 8.3x10"7 0.8 74 0.6,1.0 ENST00000373800 
PCSK5 Standing height 1.9x10°? -0.05 829 -0.1,-0.01 1.3x10!6 -0.7 50 -0.9,-0.5 ENST00000376752 
UGTIA9 Total bilirubin 6.9x103 0.17 227 0.04,0.3 8.7x10!6 0.4 355 0.3,0.5 ENST00000354728 
TINF2 ae 1.3x105 0.44 92 0.2,0.6 11x105 21 14 1.5,2.6 ENST00000557921 
PFKM HbAIc 4.8x10° -0.25 391 -0.3,-0.2 1.4x10°5 -0.5 201 -0.6,-0.4 ENST00000549941 
TIN ile blood g.6x10° -0.08 4430 -0.1,-0.1 18x10 -0.2 1807 -0.2,-0.1 ENST00000359218 
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CHD2 as 1.4x105 0.50 67 0.3,0.7 L8x10'* 2.1 12 1.5,2.6 ENST00000394196 
NFI PRN 4.4x108 -0.29 312 -0.4,-0.19  3.5x10 -1.4 25 -1.8,-1.0  ENST00000431387 
a ENST00000346192; 
j m -8 à 06- -13 7 atti > 
IGF2BP2 Standing height = 2.5x10 0.47 53 0.6,-0.3 4.0x10 0.9 25 1.1,-0.7 ENST00000382199 
PKDI Urate 3.3x10° 0.21 313 0.1,0.3 4.5x105 0.6 97 0.4,0.8 ENST00000423118 
Apolipoprotein 4 “12 ENST00000078445; 
CREB3L3 6.5x10 -0.07 709 -0.1,-0.03  2.2x10 -0.4 205 -0.6,-0.3 ENST000005959033 
CLECIIA Standing height — 3.7x10° -0.02 3452 a 3.5x10? -0.1 5719 -0.1,-0.04 | ENST00000250340 
TPM4 oe 9.1x106 0.62 42 0.3,0.9 8.4x102 1.8 12 1.3,2.3 ENST00000586833 
COLI8AI Apolipoprotein 6.6x10* -0.07 876 -0.1,-0.03 11x10" -0.2 620 -0.3,-0.2  ENST00000355480 
PFKM Pyruvate 8.7x10° 0.55 05 0.4,0.7 14x10" 1.2 30 0.9,1.6 ENST00000546465 
RNF10 eee 1.7x107 0.04 296 -0.02,0.1 3.2x10 0.7 76 0.5,0.9 ENST00000413266 
ANKI HbA Ic 6.4x10° -0.23 18 -0.4,-0.1 83x10" -0.9 47 -1.1,-0.6  ENST00000520299 
NFI Standing height  1.0x106 -0.17 322 -0.2,-0.1 9.6x10 -0.8 27 -1.0,-0.5 ENST00000431387 
Erythrocyte 
Trog, 4 oF ENST00000319836; 
MARCHFS8 ir aa 1.1x10 0.17 456 0.1,0.3 9.7x10 0.5 165 0.3,0.6 ENST00000395769 
PRCI Platelet crit 1.7x105 -0.22 304 -0.3,-0.1 18x10 -0.5 130 -0.7,-0.4 ENST00000442656 
Mean 
LARP]I corpuscular 1.7x10? -0.23 87 -0.4,-0.04 8.9x1010 -1.0 30 -1.3,-0.7 ENST00000518297 
hemoglobin 
Immature 
f ENST00000319836; 
-6 -09 > 
MARCHF8 oe 7.7x10 0.21 441 0.1,0.3 1.0x10 0.5 157 0.3,0.6 ENST00000395769 
UGTIA8 Total bilirubin 4.8x107 -0.18 99 -0.4,0 14x10% 04 227 0.3,0.5 ENST00000373450 
PETO 3 5 ENST00000078445; 
CREB3L3 Triglycerides 2.2x10 0.06 1880 0.02,0.1 1.7x10 0.4 223 0.3,0.5 ENST00000595923 
PTCHI Standing height 5.2x104 -0.16 167 -0.3,-0.1 2.7x10° -0.4 87 -0.5,-0.3 ENST00000468211 


Table 5. Transcript-specific results with significant quantitative traits associations, and 95% CI of effect size 
not-overlapping with the gene-based burden 95% CI. For loci with multiple significant results, or multiple 


highly correlated phenotypes, the result with the lowest P value is shown. Multiple transcripts are listed when 


the variant sets are identical. 


4.6. Transcript-specific variant sets limit pLOF variants in low expression exonic regions 


One method by which the transcript-aware variant sets improve burden testing is by excluding 
variants within weakly expressed exonic region. An example of this improvement can be shown 
with LDL cholesterol and the low-density lipoprotein receptor (LDLR). We evaluated seven distinct 
transcript-isoforms variant sets for their association with apolipoprotein B, the main protein found 
in LDL. All seven tested LDLR transcript sets were more statistically significant and had larger 


effect sizes as compared to the gene-based inclusive method (Table 6). 


The best performing LDLR transcript, ENST00000252444, compared to the worst performing 
LDLR transcript, ENST00000557933, and the gene-based model, lacks pLOF variants primarily in 
two critical regions: the first exon and part of the penultimate exon, highlighted in pink (Figure 5). 


5’ 


H 


H 


3 


pext 


ENST00000252444 
WBENSTO0000557933 =E 0 


E Inclusive model 


Figure 5. Two LDLR transcript models and the inclusive, gene-based model overlayed with pext = 0 
regions in pink. No pLOF variants appear in the terminal exons of all three models. 
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In both regions, pext, or the proportion expressed across transcripts”, is equal to 0, indicating that 
these regions have extremely low expression across all isoforms. All seven tested LDLR transcript- 
aware variant sets excluded some variants in the pext = 0 regions, and subsequently, resulted in an 


improved apolipoprotein B association compared to the gene-based method. 


Median 
expression in all 


Gene Phenotype P value Effect size _N carriers 95% CI Transcripts tissues (TPM) 
LDLR Apolipoprotein B 5.8x102 1.0 78 0.8,1.2 ENST00000252444 7.8 

LDLR Apolipoprotein B 6.7x10'8§ 0.9 84 0.7,1.1 ENST00000558518 0.5 

LDLR Apolipoprotein B 3.2x10°16 1.0 67 0.7,1.2 ENST00000455727 0 

LDLR Apolipoprotein B 1.0x10'* 0.9 65 0.7,1.2 ENST00000545707 0 

LDLR Apolipoprotein B 4.4x10715 0.9 56 0.7,1.2 ENST00000535915 0 

LDLR Apolipoprotein B 1.6x10'2 0.6 118 0.5,0.8 ENST00000558013 0 

LDLR Apolipoprotein B 1.7x10° 0.5 124 0.4,0.7 ENST00000557933 0.1 

LDLR Apolipoprotein B 3.8x104 0.2 287 0.1,0.3 Inclusive model 


Table 6. Comparison of LDLR transcript-based models and the inclusive, gene-based model on 
apolipoprotein B levels 


4.7. Transcript-specific variant sets exclude misannotated pLOF variants 


Additionally, the transcript-specific variant sets can improve association testing through the 
exclusion of misannotated variants. For example, polycysatin-1 (PKD1/) is a well-characterized 
protein for its function in causing 85% of autosomal dominant polycystic kidney disease cases’. 
When damaged, the kidneys are unable to clear waste products like urea and creatinine which 
instead end up in high concentrations in the blood. Elevated serum urate is documented in rare- 
variant burden testing of PKD/ pLOF variants”*. Our results show an improved association of PKD1 
and urate using the transcript-based approach. When comparing the most significantly associated 
transcript variant set, ENST00000423118, and the gene-based burden, 12 variants are excluded. The 
most frequent among these is rs758337073, a PKDI variant labeled as “likely benign” by ClinVar”. 
Rs758337073 is a “stop gained” pLOF in ENST00000488185 and is subsequently designated as a 
pLOF in the gene-based method. However, rs758337073 is not considered a pLOF in 23/24 PKD1 
transcripts. ENST00000488185 has low overall expression, and zero expression in kidney cortex or 
medulla as shown by GTEx, indicating that this is likely a misannotated pLOF, and its inclusion in 
the gene-based method adds noise and dampens the PKD/-urate burden association (Figure 6). 


3’ ye 18758337073 5 7.5 
aA} H PENsTooooo423118 E 
ENST00000488185 ` 

EEEN ime m n i E | UT f Inclusive model 7 


0.0 


Median expression in 


Figure 6. Median expression of PKD/ transcript isoforms ENST00000423118 and all tissues (TPM) 


ENST00000488185, and the inclusive, gene-based model. 


5. Discussion 


The drug discovery process is long, costly, and rarely ends in approval. Human genetic evidence 
provides an opportunity for novel target identification and validation for existing programs. Both 
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common and rare variant genetic analyses have been shown to improv the chances of a successful 
clinical trial and form the basis of rational drug discovery and development”. 

Our analysis highlights the importance of incorporating transcript-aware analyses into RVAS. 
We find that a transcript-aware approach broadly leads to lower circulating levels of cis-proteins as 
compared to the gene-based method. Since we expect pLOFs to lead to nonsense-mediated decay, 
and a reduction of functional RNA and protein products, this indicates that the included variants are 
more likely to be functioning as true LOFs. This is also evident in quantitative-trait testing, where 
we observe increased absolute value of effect sizes for the isoform-specific variant sets. The 
transcript-level approach also identifies novel isoform-trait associations, and in rare cases, identifies 
associations with an opposite direction of effect as compared to the gene-based method, as is the 
case with GPNMB, HMOX2, and BSTI and their proteins encoded in cis. These data indicate the 
potential for a transcript-aware approach to elucidate new genetically validated drug targets, some 
of which may be isoform-specific. 

Previously published literature has highlighted the importance of considering transcript data in 
RVAS. Cummings et al. described variants overlapping low confidence transcripts as a main 
contributor to the false annotation of pLOF variants. To counteract this, the authors developed the 
“proportion expressed across transcripts” (pext) score which quantifies the expression of transcript 
isoforms and exons. When testing pLOF variants in low pext-scored regions, the authors reported 
effect sizes comparable to the inclusion of synonymous variants. However, testing pLOF variants 
in high pext-scored regions resulted in substantially larger effect sizes”. This is consistent with our 
results showing that the transcript-level burden leads to larger effect sizes, in some cases, like for 
LDLR, by excluding variants in low expression exonic regions. 

Our approach is limited in several ways. By only using transcript isoforms detected in at least 
one of 53 GTEx tissues, we exclude transcripts that may be expressed in other tissue and cell types. 
For example, several quantitative ocular phenotypes were tested, but we did not utilize data on 
ocular transcript isoform expression. Additionally, our analysis was only conducted on European- 
ancestry individuals due to limited sample size of other ancestral groups; RVAS in other populations 
may yield additional associations. 

One drawback to the transcript-aware approach is the reduction in sample size, as all isoform- 
aware variant sets are smaller than their gene-based counterparts. Additionally, a given gene can 
have multiple alternatively spliced, biologically relevant isoforms, where a pLOF variant in any 
number of those isoforms may lead to the same deleterious effect on a phenotype. In that case, 
testing a single transcript would not be a sufficient representation, and instead it would be better to 
use a more inclusive multi-transcript or gene-based approach. 

It is possible to test all transcript-variant sets alongside the gene-based method, as we have done 
here. However, this leads to an exceptionally stringent P value threshold and many highly related 
experiments. We suggest a curated implementation of the transcript-approach by testing only 
specific transcripts chosen a priori, for example, only canonical transcripts, MANE-select 
transcripts?” which intend to choose the most biologically relevant, representative isoform for each 
gene, or highly expressed transcript isoforms in relevant tissue types. 
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The drug development pipeline for a new compound can last 10-20 years and cost over $10 
billion. Drug repurposing offers a more time- and cost-effective alternative. Computational 
approaches based on network graph representations, comprising a mixture of disease nodes 
and their interactions, have recently yielded new drug repurposing hypotheses, including 
suitable candidates for COVID-19. However, these interactomes remain aggregate by design 
and often lack disease specificity. This dilution of information may affect the relevance of 
drug node embeddings to a particular disease, the resulting drug-disease and drug-drug 
similarity scores, and therefore our ability to identify new targets or drug synergies. To 
address this problem, we propose constructing and learning disease-specific hypergraphs 
in which hyperedges encode biological pathways of various lengths. We use a modified 
node2vec algorithm to generate pathway embeddings. We evaluate our hypergraph’s ability 
to find repurposing targets for an incurable but prevalent disease, Alzheimer’s disease (AD), 
and compare our ranked-ordered recommendations to those derived from a state-of-the-art 
knowledge graph, the multiscale interactome. Using our method, we successfully identified 7 
promising repurposing candidates for AD that were ranked as unlikely repurposing targets 
by the multiscale interactome but for which the existing literature provides supporting 
evidence. Additionally, our drug repositioning suggestions are accompanied by explanations, 
eliciting plausible biological pathways. In the future, we plan on scaling our proposed method 
to 800+ diseases, combining single-disease hypergraphs into multi-disease hypergraphs to 
account for subpopulations with risk factors or encode a given patient’s comorbidities to 
formulate personalized repurposing recommendations. 


Supplementary materials and code: https://github.com/ayujain04/psb_supplement 
Keywords: Hypergraphs, Precision Medicine, Drug Repurposing, Disease Specificity 


1. Introduction 


The development of new drugs can take more than 15 years, from the discovery and pre-clinical 
phase to review by regulatory agencies.'! Hence, repurposing drugs previously approved by the 
Food and Drug Administration or European Medicines Agency serves as a convenient al- 
ternative since they are already known to be safe in human populations. From a research 


© 2023 The Authors. Open Access chapter published by World Scientific Publishing Company and 
distributed under the terms of the Creative Commons Attribution Non-Commercial (CC BY-NC) 
4.0 License. 
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and development perspective, drug repurposing is a less risky enterprise. Indeed, following 
compound identification, repositioned drugs would generally hit the market in less than 10 
years. Beyond time savings, this strategy brings significant cost savings, potentially reduc- 
ing the average pharmaceutical pipeline’s budget by over $5 billion compared to traditional 
drug development. To date, drug repurposing encompasses three main approaches: computa- 
tional biomedicine,” biological experimentation, and their combination, e.g., through systems 
pharmacology. 

Computational approaches are both more time-effective and cost-effective than in vitro 
or in vivo biological experiments, which involve high-throughput screening or phenotypic 
screening based on animal and human models, respectively. Examples of available strategies 
include signature matching, genome-wide association studies, and the retrospective analysis 
of real-world clinical information.* Their use has been unlocked by the concurrent emergence 
of technical advances such as biological microarrays and the increase in data accessibility, as 
illustrated by the rapid growth of electronic health records and biobanks.° 

Simultaneously, massive genomic databases and cell lines have yielded 20+ high-quality 
biological and biomedical knowledge graphs (KG) such as SPOKE® and PrimeKG’ and ag- 
gregating platforms such as the KG-Hub® to ensure that the former can be shared and made 
interoperable for downstream graph machine learning tasks. Network-based methods for drug 
repurposing rely on the encoding of interactions between entities (i.e., drugs, diseases, proteins, 
biological functions) that can be heterogeneous (i.e., inhibition, binding). These representa- 
tions can help address both predictive (e.g., polypharmacy side effects) and inferential (e.g., 
reasoning over causal pathways) questions. Prior graph representations such as the multi-scale 
interactome (MSTI)? have proved useful in identifying agents that were previously repurposed 
and in formulating new potential drug repurposing candidates. 

However, drug repurposing hypotheses output by algorithms or deep learning models de- 
ployed on KGs may appear as “black boxes.” Yet structural and/or functional explanations 
are often desirable and necessary to understand the possible mechanisms of action underlying 
a predicted relationship between an existing drug and a disease — be it beneficial or detri- 
mental. Further, KGs integrating various data sources are rarely disease-specific. Thus, they 
may result in overall drug similarities that do not hold for the pathology of interest or in 
spurious correlations. This concern is especially relevant for neurodegenerative diseases such 
as AD, given the presence of the blood-brain barrier!® and differential gene expression levels 
and patterns in the brain — relative to other tissues — and across brain regions themselves. 


Contributions. Hypergraphs have seen success in uncovering relationships in areas like mar- 
keting,!! finance,!? and computer vision.!? Building upon this precedent in other disciplines, 
we propose disease-specific hypergraphs as the basis for data-driven drug repurposing. Im- 
portantly, hypergraphs allow encoding relationships among groups of nodes (i.e., hyperedges) 
rather than pairwise relationships (i.e., edges) only. In our study, hyperedges capture known 
biological pathways. First, we show that the properties of hypergraphs reflect relative disease 
complexity. Second, we transform disease-specific hypergraphs into weighted graphs where 
nodes encode biological pathways and weighted edges relate to the number of entities that 
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they have in common (e.g., the number of shared genes or proteins). With the intent of en- 
compassing disease specificity, we focus on pathways that start with a drug entity and end 
with the disease entity of interest, irrespective of their length. Using a modified node2vec!4 
algorithm, we learn the disease-specific embeddings of each hyperedge. In particular, we use 
these low-dimensional representations to find original biological pathways that are highly 
similar to those whose starting entity is a drug currently prescribed to treat the disease or 
mitigate its progression. Then, we pool the top k or k% candidate biological pathways for 
various values of k and analyze the distributions of starting drug entities and middle gene 
entities. Such prevalences help gain a mechanistic understanding of promising drug classes 
and targets for repurposing. We illustrate our proposed method in the context of Alzheimer’s 
disease (AD), a multi-factorial disease of aging that still has no cure despite recent progress.!° 
We demonstrate that our proposed method outputs candidate biological pathways that are 
topologically non-obvious, i.e., they do not have any entities in common with the reference 
pathways involving currently prescribed drugs, besides the end disease entity. To assess the 
utility and complementary of learning disease-specific pathway embeddings, we contrast these 
non-obvious suggestions with those of the MSI, compare the corresponding rank orderings, 
and validate our findings by mining the biomedical literature to find supporting evidence. 
Our comparative analysis and publication search reveals that certain candidates that were 
highly ranked (i.e., in the top 10%) by our hypergraph-based learning approach for AD drug 
repurposing and had supporting evidence in the literature were missed by the MSI (i.e., in the 
bottom 33% across all drugs). Going forward, our proposed framework can be scaled to derive 
novel drug repurposing hypotheses for each of the 800+ major diseases currently registered 
on the KG-Hub® (i.e., excluding rare and orphan diseases). 


2. Methods 


In this section, we describe our proposed approach, which encompasses three main parts: 
hypergraph construction, pathway/hyperedge representation learning, comparative analysis 
with the MSI,’ and mining of the biomedical literature to find supporting evidence. 


2.1. Hypergraph Construction 


We built disease-specific hypergraphs by querying the Hetionet!® knowledge graph, which 
comprises 1,522 drugs, 5,734 side effects, and 137 diseases, to extract significant? biological 
pathways connecting each drug present in the KG to the disease of interest. Hetionet is an 
existing state-of-the-art knowledge graph that incorporates 11 node types (e.g., gene, symp- 
tom), allowing for vast heterogeneity in the node composition of “metapaths” going from a 
compound to a disease, which we sample from to create disease-specific hypergraphs (Figure 
la). We query the Hetionet to retrieve all paths starting at one of the 1,522 drugs and end- 
ing at a disease of choice. These paths are further grouped into metapath categories based 
on the type and order of nodes present within the path. For example, a metapath category 
could be “drug-genel-gene2-target Disease” or “drug-gene2-similarDisease-target Disease.” We 


aA more detailed definition of significance follows. 
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use the direct weighted path count and adjusted p-value defined by the Hetionet to quantify 
the significance of a path, relative to others within its metapath category. We include the 
top 10% most significant paths within each category to create our induced disease-specific 
subgraph. We reasoned that selecting only the most significant pathways would help miti- 
gate the resulting number of false positives among drug repurposing candidates. The largest 
connected component is treated as the subgraph of interest (Figure 1b); other components, 
generally much smaller in size, are ignored. All existing biological pathways in the resulting 
subgraph are explicitly unified as hyperedges, creating a disease-specific hypergraph (Figure 
1c). Lastly, we transformed our disease-specific hypergraph into a disease-specific graph where 
the nodes now correspond to the biological pathway hyperedges that originally constituted 
the hypergraph. Two biological pathway nodes are connected if they share another element 
in their path besides the start entity (drug) and the end entity (disease). The edge weight 
is defined by the number of shared elements w, normalized between 0 and 1 using min-max 
scaling (Figure 1d) to enable comparisons of graph structures across diseases. 


2.2. Biological Pathway Hyperedge Representation Learning 


Given a specific disease of interest, our study aimed to identify biological pathways analogous 
—in a learned distinct dimensional subspace — to those associated with drugs currently used 
to treat it. In particular, we conducted a case study on Alzheimer’s disease and considered 
medications prescribed to alleviate the associated symptoms and behavioral complications. We 
focused primarily on three compounds: donepezil, galantamine, and memantine,!™ 1° approved 
by the FDA in 1996, 2001, and 2003, respectively. Our approach is disease-agnostic and can be 
readily extended to other diseases than AD, upon the supply of a list of compounds currently 
used in clinical practice or previously suggested as repurposing candidates and provided access 
to adequate computing resources. 

Our methodology involved initiating a random walk of fixed length L on the transformed, 
weighted graph Gu delineated in Figure 1 (d), commencing from any of the biological pathway 
nodes whose first path element was one of the drugs currently prescribed against the disease 
of interest. We accounted for the presence of weighted edges by sampling neighboring nodes 
proportionally to the strength of the connection. The random walker began at a selected node, 
then proceeded iteratively to an adjacent node chosen uniformly at random among possibly 
duplicated neighbors, and repeated this process for a predetermined number of steps. For each 
eligible starting node in our weighted graph Gwu, a random walk was initiated, with a fixed 
length set at L=80. Each start node-specific random walk was replicated R=10 times, in light 
of the vast heterogeneity of node types and the resulting variation in feasible trajectories. 

We denote by v; the position of the random walker at iteration i = 1. At each iteration i 
of the random walk, the probability of transitioning from biological pathway x to biological 
pathway y is expressed as: 


__weight of edge between x and y 
~ sum of weights of all edges leaving x 


P(vi = y|vi-1 = z) 
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Fig. 1: Pipeline to derive disease-specific hypergraphs from existing KGs and learn contextual 
embeddings of biological pathways. (a) Full Hetionet graph with nodes of 11 types, includ- 
ing 1,522 drugs, 5,734 side effects, and 137 diseases. (b) Disease-specific subgraph, selecting 
only the biological pathways whose end node is the disease of interest. Of note, only the top 
10% most significant pathways within each metapath category (as defined by their length and 
structure, e.g., drug-genel-gene2-disease) are retained, based on a path importance score as- 
signed by Hetionet.!6 (c) Disease-specific hypergraph unifying significant paths or hyperedges 
into a single structure. (d) Disease-specific weighted graph resulting from the transformation 
of the hypergraph described in (c). Each hyperedge in (c) becomes a node in (d) and nodes 
in (d) are connected if their corresponding biological pathways in (c) have at least another 
element in common, beyond the disease node. Each edge is assigned a weight w, corresponding 
to the number of elements common to the two biological pathways. Note that the weight does 
not include the disease node, which all pathways present in a given disease-specific hypergraph 
intersect at, by design; similarly, the compound node at the start of each reference biological 
pathway does not contribute to the weight either, continuing our focus on learning biological 
similarities between the drugs). 


2.2.1. Skip-Gram Model 


We interpreted the resulting random walks as sentences, utilizing the Word2Vec Skip-Gram 
model provided by gensim to develop node embeddings for each biological pathway.!* This 
model predicts context words (nodes within the same walk) given a target word (a node). 
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Applied to the context of our disease-specific weighted graph, the embeddings of biological 
pathways learned through this process encapsulate the local neighborhood structure of the 
nodes and are subsequently used for our pathway similarity search. 

The Skip-Gram model’s objective is to devise word representations that effectively predict 
surrounding words in a sentence or document.?? Formally stated, given a sequence of training 
words w1, w2, ..., wr, the model aims to maximize the average log probability obtained via the 
chain rule: 


T 
Z `> > log P(wr4;|we) (2) 
t=1 —k<j<k,jA0 

where k denotes the size of the training context and T denotes the total number of training 
words. In linguistics, k often represents the typical length of a sentence; by analogy, in biology, 
it could encode the number of reactions occurring in cascade. Similarly, in linguistics, T 
often represents the size of the vocabulary, which can be language-specific; in biology, the 
total number of biological pathways involved is disease-specific. To alleviate the fact that 
pathway length can greatly vary, we guided the model to learn embeddings of fixed dimension 
p = 64 dimensions. We subsequently used cosine similarity as the metric to quantify similarity 
between any two biological pathways. 

Our decision to utilize the Skip-Gram algorithm for learning embeddings was driven by our 
intent to infer semantic contextual relationships among biological pathways, given a specific 
disease. To learn the embeddings, we chose the random walk and skip-gram based approach 
to derive a first proof of concept of using a hypergraph structure and explicitly restricting it 
to a given disease. 


2.3. Methods for Evaluation 


Our approach to proposing repurposing hypotheses for a given disease of interest relies on 
identifying the top 10% of biological pathway embeddings having the highest cosine simi- 
larity with pathways initiating from any of the drugs known to mitigate or prevent disease 
progression and ending at the disease node. While pathways can include a variety of interme- 
diary nodes (e.g., symptom, anatomical object, etc.), we selected exclusively those pathways 
with one or more gene intermediary nodes linking one of the 1,522 drug candidates to the 
considered disease. We reasoned that this feature would help focus our hypotheses on biolog- 
ically plausible pathways and thus facilitate the interpretation of drug repurposing candidate 
rankings. 

The data and methods of the multiscale interactome (MSI) were used as a baseline for 
comparative analysis. The MSI consists in a large biological KG with 1,566 drugs and 841 
diseases and leverages a random walk approach to formulate repurposing hypotheses. 

From our weighted graphs, we quantified the rank of each biological pathway in terms 
of its cosine similarity to a selected relevant pathway. We considered relevant pathways to 
be those whose starting drug entity was a drug currently indicated against the disease of 
interest. To obtain a single metric per drug, we either aggregated the cosine similarity scores 
of the pathways in which it was involved into an median value or used the pathway with 
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the maximum similarity. Then, we used this summary metric to rank all considered drugs 
and contrast our own repurposing suggestions with those of the MSI. We used these metrics 
and the relative rankings of psychoanaleptics in the MSI vs. our hypergraph to compare their 
AUC. 

From the MSI, we derived rankings of the most similar drug pairs based on the rankings 
of the drugs most similar to disease-specific drugs, based on the cosine similarity of their 64- 
dimensional embeddings. We also established a rank-ordered list of drugs most similar to the 
disease of interest (e.g., AD), given the cosine similarity between drug and disease embeddings. 
Notably, while the rank of a drug as derived from the MSI is the output of either a single 
similarity score (with the disease’s embedding) or a couple of scores only (with the embeddings 
of known drugs against this disease), its rank as derived from our hypergraph-based approach 
is the output of a much larger double-averaging operation, across all pathways starting at the 
drug of interest and those starting at a drug already known to target this disease. 

We aimed at uncovering any potential blind spots in the MSI that our methodology might 
successfully uncover. To this end, for drugs that our approach ranked among the top 10%, we 
retrieved the corresponding MSI-derived rankings for comparison. For each pair, we computed 
the absolute difference in rank. In addition, we computed an aggregate similarity metric be- 
tween the MSI and our approach, defined as the size of the overlap between the sets of drugs 
appearing in the top 10% under each. 

To further validate our methodology and the relevance of the resulting pathway embed- 
dings, we undertook a deeper analysis of the drug repurposing suggestions that most differed 
between the MSI and our proposed method, based on the difference in ranks. In particular, 
we searched the biomedical literature for biological and/or clinical evidence about drug re- 
purposing suggestions that fell within the bottom third of the MSI’s rank-ordered list (i.e., 
compounds ranked 1,032 to 1,522) while being in the top 10% of ours. 


3. Results 


We conducted several experiments on our disease-specific hypergraph, using a sample of 18 
prevalent and/or incurable diseases. First, we computed summary statistics about these 18 
disease hypergraphs and formed clusters that reflect known disease complexity (Section 3.1) 
Second, we learned disease-specific embeddings of biological pathways on these hypergraphs to 
identify potential drug targets. To interpret our findings, we explored the distribution of genes 
involved in the resulting pathways (Section 3.2 and Figure 3). Our intent was to confirm some 
of the repurposing hypotheses that emerged from the MSI and to formulate new ones. Third, 
we reviewed the literature to gather information about targets overlooked by the MSI but 
documented in prior studies; we summarize our findings (Section 3.3). We also evaluated the 
AUC differences between the MSI and our Alzheimer’s disease specific hypergraph (Appendix 
B in our supplement) 


3.1. Hypergraph Construction Underlines Known Disease Complexity 


Utilizing our method, we constructed 18 disease-specific hypergraphs (Figure 2). For each, 
we computed the number of hyperedges (biological pathways) and the number of weighted 
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links among them; we mapped diseases on a scatterplot along these two dimensions. Using 
the k-means clustering algorithm (k=4), we grouped the 18 hypergraphs into four clusters. In 
addition to this visual representation, we quantified the number of protein nodes that each 
disease connects with in the MSI — a proxy to characterize disease complexity. Generally, more 
complex hypergraphs, with a larger number of hyperedges and links, were those of diseases 
involving a larger number of proteins. This suggests that the network properties of disease- 
specific hypergraphs could be leveraged to summarize their complexity and identify diseases 
that may be proximal based on their higher-order structure.The richness of the embeddings 
that we seek to learn will depend on the size of the underlying hypergraph; in particular, 
smaller hypergraphs may yield sparser embeddings. For instance, Figure 2 highlights a clear 
separation between chronic diseases such as Chronic Kidney Disease (CKD) and Coronary 
Heart Disease (CHD) and more complex diseases such as Rheumatoid Arthritis (RA) and 
Amyotrophic Lateral Sclerosis (ALS) involving auto-immune processes. Among the 18 hyper- 
graphs, those of diseases in Cluster A boast more information to learn from and potentially 
uncover peripheral biological pathways of importance; this configuration prompted us to select 
one of the diseases in cluster A for our case study. Alzheimer’s Disease (AD) was chosen due 
to its large and growing prevalence, as about 6.2 million Americans aged 65 and older are 
currently affected — a number which could rise to 13.8 million by 2060.7! 


Cluster ID Diseases Present Proteins connected in 
the MSI 
1e6 A Alzheimer's Disease (1), (1) 73 
Rheumatoid Arthritis (2), (2) 162 
@ Clustera A e ALS (3) (3) 49 
4 | @ Cluster B 
@ = Clusterc e (J 
@ Cluster D B pus (4), (4) 60 
Psoriasis (5), (5) 52 
4 34 Asthma (6), (6) 92 
= Hepatitis B (7 7) 6 
? e p patitis B (7) 7) 
` A e o e c Hypertension (8), (8) 151 
£ e Type 2 Diabetes (9), (9) 78 
3 ® Type 1 Diabetes (10), (10) 9 
Cc 
14 ® D CHD (11), (11) 15 
D CKD (12), (12) 39 
Diabetic Retinopathy (13), (13) 13 
o4 ot oO Renal Failure (14), (14) 13 
h = , , e 7 : Parkinson's Disease (15), (15) 39 
(0 2000 4000 6000 8000 10000 12000 AIDS (16), (16) 1 
Number of Hyperedges Atrial Fibrillation (17), (17) 34 
Vascular Dementia (18) (18) N/A 


Fig. 2: The scatter plot represents four disease clusters, based on two structural attributes of 
their respective disease-specific hypergraphs, constructed as outlined in sections 1(b-c). The 
location of a given cluster indicates the complexity of the higher-order hypergraph structures 
and often reflects disease complexity. Diseases known to be highly complex (e.g., AD) are 
positioned in the top right corner; conversely, diseases deemed to be of lower complexity (e.g., 
renal failure) are situated in the bottom left corner. While disease complexity is primarily 
defined by the number of currently known biological pathways involved, we also provide the 
number of proteins to which each disease is directly connected (i.e., one-hop neighbors) in the 
MSI.’ 
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3.2. Hypergraph Representation Learning Identifies Repurposing Targets 
in Accordance with the MSI 


Both the MSI and our AD specific hypergraph shared a 50% overlap in drug categories for their 
repurposing suggestions: psycholeptics, psychoanaleptics, and drugs used in diabetes manage- 
ment, all of which are supported to have repurposing targets to AD in the literature.???? The 
MSI also brought attention to drugs acting on the renin-angiotensin system, sex hormones, 
and other nervous system drugs, thereby grouping together common co-morbidity targets for 
AD treatment.2* 2” In contrast, our AD-specific hypergraph focused more on antineoplastic 
agents, cardiac therapy, and ophthalmologicals, each of which has been associated with AD 
in the literature, as well.?° °° 

Figure 3 shows how our hypergraphs can be used to add a layer of explainability to 
existing knowledge graphs. The figure also compares the number of gene targets that each 
of the suggestions from both our method and the MSI contained. After finding the top 10% 
of drugs most similar to AD in the MSI, we found all gene intermediary nodes in the paths 
starting at these drugs and ending at AD in our hypergraph. We then compared the makeup 
of these pathways to the makeup of the top 10% most similar pathways to those of donepezil, 
memantine, or galantamine (ranked by highest cosine similarity to any of the three drugs). 

It is important to note the discrepancy in the number of paths considered when generating 
Figure 3(a)-(g) and Figure 3(h)-(n). The former, involving 926 paths, includes all paths in the 
AD hypergraph that start with a drug in the top 10% cosine similarity to AD in the MSI. 
Conversely, the latter, with 574 paths, only encompasses the top 10% of pathways that exhibit 
the highest cosine similarity to paths initiating from donepezil, memantine, or galantamine. 

These findings underscore the efficacy of our disease-specific hypergraph approach in tar- 
geting drugs with pathways highly similar to those of known pertinent drugs when identifying 
potential candidates for repurposing. Moreover, these outcomes provide initial validation to 
our hypergraph representation learning method, which will be further discussed in the follow- 
ing subsection. 


3.3. Hypergraph Representation Learning Identifies Drug Repurposing 
Targets Discounted from the MSI but Present in Literature 


Hypergraph representation learning suggested 7 drug repurposing targets out of its top 30 
(23%) that the MSI discounted (rank of > 1032 in either column (2) or (3) of Table 1 in 
the supplement). The 7 drugs were eplerenone (diruetic), fosphenytoin (cardiac therapy), ex- 
emestane (endocrine therapy), eperisone (muscle relaxants), protriptyline (psychoanaleptics), 
ethotoin (antiepileptics), and pentamidine (antiprotozoals). 4 out of 7 (eplernone, pentami- 
dine, exemestane, and protriptyline) of these have literature supporting their potential efficacy 
against AD. For the remaining 3 out of 7 (eperisone, ethotoin, and fosphenytoin), we explored 
tangential literature to evaluate the suggestion and/or looked upon the path that this druge 
headed in hopes of understanding why the prediction was made. Refer to Table 1 in the 
supplement for the exhaustive list of the top 30 repurposing suggestions based on pathway 
similarity to donepezil, memantine, or galantamine in the AD hypergraph. 
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Fig. 3: (a)-(g) illustrate the number of gene targets within the paths of our AD hypergraph, 
originating from a drug node that ranks within the top 10% in terms of highest cosine similarity 
to the AD node in the MSI. (h)-(n) depict the number of gene targets in paths within our AD 
hypergraph that are within the top 10% in similarity to the pathways of donepezil, memantine, 
and galantamine, only considering paths with gene intermediary nodes.(a) presents the six 
categories with the most gene targets among the MSI’s top 10% suggestions. Conversely, 
(h) displays the six categories with the most gene targets in the AD hypergraph’s top 10% 
predictions based on similarity to donepezil, memantine, and galantamine. (b)-(g) further 
break down these top six categories from (a), demonstrating the count of each gene in the top 
10% of predicted paths. Similarly, (i)-(n) break down the top six categories from (h), showing 
the count of each gene within the top 10% of paths similar to those of donepezil, memantine, 
and galantamine, as determined by cosine similarity. In these graphics, ‘n’ represents the 
number of unique drugs within this category, while ‘m’ signifies the number of unique gene 
targets in (b)-(g) and (i)-(n). 


3.3.1. Literature Review on 7 Targets Found by Hypegraph Representation Learning but 
Missed by MSI 


In this section, we delve into the literature that supports our hypotheses for drug repurposing. 
These drugs were identified as potential repurposing candidates, yet were overlooked by the 
MSI. 

Eplernone has been observed to decrease brain damage, defined by cell death and cortical 
thinning, in a rat model.*! Additionally, it is documented that eplernone enhances cognitive 
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function in a mouse model of AD.3? Another study reinforces these findings, illustrating that 
eplernone can mitigate cognitive deficits in the hippocampus of spontaneously hypertensive 
rats. These outcomes coincide with the established correlation between hypertension and 
dementia/AD.** Lastly, an in silico pharmacological assessment of eplernone proposed that 
the drug holds potential in treating AD.*° 

Exemestane exhibits efficacy when AD patients are concurrently dealing with cancer, sug- 
gesting that women diagnosed with breast cancer who underwent treatment with tamoxifen 
or exemestane exhibited fewer instances of AD.°° A subsequent study characterizes the rela- 
tionship between AD and cancer, demonstrating that exemestane is proficient at managing 
cancer when it co-occurs with AD.37 Additional research has suggested exemestane as a po- 
tential therapeutic for Parkinson’s Disease (PD),38 a neurodegenerative disorder associated 
with AD.°*? 

Protriptyline was found to have the highest inhibitory activity among 140 FDA approved 
nervous system drugs against the three primary AD targets: AChE, BACE-1, and A aggre- 
gation.204! A study using an AD rat model concluded that protriptyline reduces oxidative 
damage and improves spatial memory in AD mice.*” 

Pentamidine, in a mouse model of AD, was found to inhibit AG-induced gliosis and neuroin- 
flammation in AD mice.*? However, to our knowledge, this is the only publication endorsing 
its use in alleviating AD, likely because pentamidine is unable to cross the blood-brain barrier. 
However, recent developments in nose-to-brain methods could surmount this obstacle.!° 

Ethotoin (antiepileptic), to our knowledge, doesn’t have explicit literature connecting it 
to AD. There is, however, a study that warns that antiepileptics could escalate stroke risk 
in AD patients.“4 This elucidates a current limitation of our approach: we currently do not 
differentiate between positive and negative drug pathways to a disease. 

Fosphenytoin’s affect on AD is not specifically discussed in the literature. However, it is a 
prodrug of phenytoin,* which inhibits hippocampal tissue degradation and consequently the 
progression of AD.*6 

Eperisone lacks direct literature linking it to AD, to the best of our knowledge. When 
examining the pathway, starting at eperisone and ending at AD that was suggested to have 
a high similarity to galantamine (see Table 1. in the supplement) in our hypergraph, we find: 
Eperisone-Triporlidine-CYP2D6-AD. This pathway shows that eperisone was connected to 
AD by way of similarity to triporlidine. Triporlidine has been observed to enhance NREM 
sleep in AD patients.*’ 


4. Conclusion and Future Directions 


Our disease-specific hypergraphs have proven useful for clustering diseases based on their 
known complexity, identifying potential drug repurposing targets alongside existing methods, 
and discovering promising repurposing targets overlooked by state-of-the-art methods. We 
found that the disease hypergraphs formed four clear groups when comparing the number of 
hyperedges to the number of links between these hyperedges (see Figure 2). Additionally, in 
Figure 2, we see more complex hypergraphs correlated with more known disease complexity, 
which we assessed by counting the protein-disease connections in the MSI. 
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We also demonstrated the value of this method in generating drug repurposing suggestions 
for Alzheimer’s disease (AD). We saw a significant overlap with the suggestions from the 
MSI when looking at the top 10% of suggestions from both methods, especially among drug 
categories with the highest number of gene targets in the pathways. 

Among our top 30 repurposing suggestions for AD, ranked by pathway cosine similarity, 
we focused on pathways with one or more protein/gene intermediary nodes, hoping to keep 
our results grounded in biological relevance. Each suggestion comes with the drug pathway 
that supports its potential use in treating AD (see Table 1 in the supplement for the full list). 

Additionally, our method also identified promising repurposing pathways for AD that the 
MSI overlooked. In fact, 7 out of our top 30 suggestions ranked in the lower third of the MSI’s 
suggestions. We found supporting evidence for these suggestions in the scientific literature, 
both from studies that directly tested the drugs and from related research. 

Looking ahead, we plan to enhance this method in several ways. We aim to refine our 
hypergraph construction by merging disease hypergraphs of co-occurring diseases such as AD, 
Type 2 Diabetes, and Hypertension. We will explore ways to improve our pathway embeddings 
and, crucially, we will look beyond the literature review to other forms of validation, including 
evidence from electronic health records and experimental studies. 

Future research will explore the use of power iteration, page rank, and page rank with 
teleportation for learning additional sets of embeddings for each biological pathway-disease 
pair. We aim to compare the resulting outputs to our current pathway embeddings pairwise 
and to assess the sensitivity of the downstream similarity scores. To derive more robust drug 
repurposing candidates for a specific disease, several embeddings of biological pathways could 
be combined to minimize dependence on a particular algorithm or parameter set and instead 
maximize confidence, across representation learning approaches. Further, we can experiment 
with more specific designs of a true positive, perhaps using literature for or against a drug in 
the context of a disease of interest. 

Additionally, we plan on doing more sensitivity analyses upon the comparison metric and 
learning method, editing parameters like the dimensions of the embedding vector (p), distance 
of each random walk (L), and amount of random walks taken per each node (R). We also plan 
on comparing our results to the hypergraphs encompassing more paths. Now that we have 
outlined a proof-of-concept for this design on the top 10% of paths ending at a disease of 
interest, we can explore how the suggestions compare to hypergraphs built with the top 20%, 
25%, etc. 
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Protein kinases are a primary focus in targeted therapy development for cancer, owing to their role 
as regulators in nearly all areas of cell life. Recent strategies targeting the kinome with combination 
therapies have shown promise, such as trametinib and dabrafenib in advanced melanoma, but 
empirical design for less characterized pathways remains a challenge. Computational combination 
screening is an attractive alternative, allowing in-silico filtering prior to experimental testing of 
drastically fewer leads, increasing efficiency and effectiveness of drug development pipelines. In 
this work, we generated combined kinome inhibition states of 40,000 kinase inhibitor combinations 
from kinobeads-based kinome profiling across 64 doses. We then integrated these with 
transcriptomics from CCLE to build machine learning models with elastic-net feature selection to 
predict cell line sensitivity across nine cancer types, with accuracy R? ~ 0.75-0.9. We then 
validated the model by using a PDX-derived TNBC cell line and saw good global accuracy (R? ~ 
0.7) as well as high accuracy in predicting synergy using four popular metrics (R° ~ 0.9). 
Additionally, the model was able to predict a highly synergistic combination of trametinib and 
omipalisib for TNBC treatment, which incidentally was recently in phase I clinical trials. Our 
choice of tree-based models for greater interpretability allowed interrogation of highly predictive 
kinases in each cancer type, such as the MAPK, CDK, and STK kinases. Overall, these results 
suggest that kinome inhibition states of kinase inhibitor combinations are strongly predictive of cell 
line responses and have great potential for integration into computational drug screening pipelines. 
This approach may facilitate the identification of effective kinase inhibitor combinations and 
accelerate the development of novel cancer therapies, ultimately improving patient outcomes. 


Keywords: Kinase signaling, precision medicine, systems biology, drug response prediction. 


1. Introduction 


Protein kinases, which serve as the primary conduits for information transfer within cells, are often 
implicated as key drivers in cancer development and have become a cornerstone in current targeted 
therapies [1]. The rapid expansion of kinase inhibitor therapies as an oncology drug class is 
exemplified by the FDA's approval of nearly 60 such therapies over the past 20 years [2]. Despite 
their initial promise, kinase-targeting monotherapies frequently give rise to resistance [3], in part 
due to the dynamic nature of the kinase network, i.e., the “kinome,” which has been shown to 
reprogram and respond to the inhibition of single kinases by upregulating expression of partner 
pathways [4-6]. This necessitates the development of novel strategies to effectively target the 
kinome and harness the vast array of potential drug targets it offers. 


© 2023 The Authors. Open Access chapter published by World Scientific Publishing Company and distributed under the terms of the 
Creative Commons Attribution Non-Commercial (CC BY-NC) 4.0 License. 
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One emerging strategy to counteract resistance involves the design of combination therapies, 
which perturb multiple targets with two or more drugs. These targets may be either known 
compensatory pathway partners, referred to as "horizontal pathway inhibition," or multiple targets 
within the same pathway, known as "Vertical pathway inhibition" [7]. This approach has recently 
gained traction with the FDA approval of the combination of trametinib and dabrafenib for treating 
advanced melanoma [8]. This combination therapy "vertically" targets both BRAF and MEK within 
the RAF-MEK-ERK (MAPK) pathway, demonstrating the potential effectiveness of this strategy. 
However, this method of empirical design of combination therapies is not feasible for less 
characterized kinase pathways, and the sheer number of possible combinations of potential kinase 
targets (2500) prevents brute-force screening or drug design. 

To circumvent this issue, computational screening offers an appealing alternative, enabling the 
prediction of effective drug combinations in-silico prior to testing a reduced pool of potential 
combinations in-vitro. This method streamlines the drug development process, and when combined 
with patient-specific genomic profiling, can also enable personalized drug combination selection to 
potentially achieve resistance-proof responses in patients. 

In recent years, a variety of computational approaches have been developed to predict 
combination therapy responses for cancer drug screening [9,10]. Most of these methods primarily 
rely on drug structure characteristics and cancer-specific baseline genomic profiling to predict 
effective drug combinations, spurred by advancements in the high-throughput acquisition of these 
data types. For example, a high-dimensional tensor-based modeling strategy used similar data and 
achieved impressive accuracy (Overall R? ~ 0.8) in predicting response to combination therapies, 
validated experimentally [11]. This approach and others employ intricate neural network 
architectures that, while capable of producing high performing models, can be challenging to 
interpret, posing a barrier to the broader understanding of their underlying mechanisms. Tree-based 
machine learning models on the other hand, although simpler and sometimes less powerful, are 
generally considered interpretable depending on the type of data fed to them [12]. Notably, drug- 
protein interactions, which are intuitively central to the process of phenotype reversal, have been 
relatively underexplored in these computational approaches. In part, the minimal amount of drug- 
target information leveraged in current response prediction efforts is because of the sheer amount 
of data generated by genomics and molecular fingerprinting, generating thousands of features for 
each measurement, while drug target data has been generally sparse with only a few annotated 
targets per drug. However, recent advances in technology to profile the interactions of clinical drugs 
with all the members of the kinome represent an unprecedented ability to measure drug-target 
information across ~500 proteins simultaneously in a quantitative manner [13,14]. The breadth, 
density, and ease of acquisition of this data, often measured at multiple dose points, is ideal for 
integration into machine learning models that can leverage diverse data types for drug response 
prediction. 

Specifically, recent advances in proteomics techniques have facilitated the large-scale 
characterization of drug-kinase interactions, providing valuable information on the extent to which 
the entire kinome is inhibited by specific drugs or drug combinations. A landmark paper in 2017 
used a mass spectrometry-based assay that used promiscuous kinase-binding compounds 
immobilized on beads to measure the binding competition between any given inhibitor and any 
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given kinase (henceforth called the “kinobeads” assay) [15]. Using this assay, the kinome-wide 
binding profiles for ~230 clinical kinase inhibitors at eight doses each were elucidated using cancer 
cell lysates, forming the largest in-cell drug-target binding database publicly available at this time. 
The data generated from these assays allow interrogation of how clinical and investigational drugs 
interact with the entire kinome on an unprecedented scale. By analyzing the degree of inhibition of 
all kinases simultaneously for a given inhibitor, we can treat this as characterizing the degree of 
departure from the “baseline kinome state”, thus moving through drug-induced alteration of multiple 
kinase activities to a new “kinome inhibition state”. Given the degree to which modulation of the 
kinome alters cellular state and downstream behavior, these baseline kinome states and kinome 
inhibition states can be directly connected to various measured cellular phenotypes. We have 
recently demonstrated this idea by showing that kinome inhibition state is significantly predictive 
of cancer cell responses to kinase inhibitor monotherapies when integrated with cancer-specific 
information, such as baseline transcriptomics, using tree-based machine learning models [16]. 

In this work, we show that by combining the inhibition states of two kinase inhibitors, we can 
generate a hypothetical “combined” inhibition state for an untested inhibitor combination. In this 
manner, we can rationally use all combinatorial kinome inhibition states to sample all possible 
kinase target combinations, hypothetically including all pathway partners. By integrating these 
inhibition states with cancer-specific baseline transcriptomics, we demonstrate that the combined 
inhibition state can predict the sensitivity of cancer cell lines to inhibitor combination treatments 
from the NCI-ALMANAC dataset using interpretable machine learning models. We further validate 
these models experimentally by examining novel inhibitor combinations in a PDX-derived triple- 
negative breast cancer (TNBC) cell line. By focusing on dual-inhibitor drug-kinase interactions 
combined with cancer-specific baseline genomic profiling, we can enhance computation 
combination drug screening pipelines with combinatorial kinase targeting. Furthermore, this 
approach lays the foundation for the rational design and a priori prediction of combination kinase 
inhibitor treatments for patients with the potential to ultimately reduce single kinase inhibitor 
resistance acquisition by prior rational targeting of partner pathways and associated kinases. 


2. Results 


2.1. Creating a Set of Combined Kinome Inhibition States Representing Current and 
Potential Kinase Inhibitor Combination Therapies 


In this work, we have focused on a specific set of 200 kinase inhibitors characterized using the 
kinobeads assay [15]. These inhibitors were profiled in-cell for their interactions with ~500 kinases 
and kinase-interacting proteins, across eight doses. From this data, as described previously (insert 
citation), we extracted monotherapy “kinome inhibition states”, denoting the degree to which they 
inhibit each kinase in the kinome at eight doses on a scale of 0-1 (0 is complete inhibition and 1 is 
no inhibition of a given kinase). We next tested different methods to approximate the kinome 
inhibition state of a kinase inhibitor combination. Intuitively, this can be thought of as simply 
superimposing two individual monotherapy inhibition states, but for the few cases where different 
inhibitors target the same kinase, we found ways to accurately reflect the resulting effect on the 
kinome. Here, we tested combining monotherapy kinome inhibition state vectors through addition, 
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multiplication, truncated multiplication (excluding kinase inhibition values >1). All three methods 
were compared for downstream model performance. 
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Figure 1. Kinome inhibition State Combination Modeling and Data Overview. (a) Schematic 
of modeling pipeline. (b) Heatmap showing the inhibition state of individual kinase inhibitors (row 
1 and 2), and the hypothetical “combined” inhibition state for the two inhibitors (row 3) (c) Bar 
plot showing number of cell lines tested per cancer type in training data set (d) Bar plot showing 
number of unique combinations tested per cell line for the breast cancer subset of the training data 
set (e) Ridge plots showing cell viability (x-axis) variation for a random subset of different kinase 
inhibitor combinations (y-axis) in the NCI-ALMANAC data for breast cancer cell lines. Different 
breast cancer subtypes are represented with differing colors. 


After combining the individual inhibition states, we were left with a dataset describing all 
possible pairwise combinations of ~220 kinase inhibitors. These ~45,000 combinations represent 
the kinome inhibition states of existing clinical therapies (example), therapies currently in clinical 
trials (example), as well as potential therapies. Together, they interrogate a search space that 
includes nearly every known kinase on the phylogenetic tree (Fig S1). 
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2.2. Connecting Inhibited Kinome States with Cancer Cell Line Combination Sensitivities 


Next, we linked the data set describing kinase inhibitor combinations to their cell sensitivity 
phenotypes in the large-scale ALMANAC drug combination screen. The ALMANAC screen 
contains cell sensitivity data for 53 kinase inhibitor combinations, over ~200 unique dose 
combinations for 45 cell lines across 9 cancer types. Additionally, previous high-throughput 
combination screens conducted in our lab in breast cancer offered data for 56 inhibitor combinations 
in four cell lines. Ideally, we would like exact matches between the dose at which kinome inhibition 
state is profiled and the dose at which cell sensitivity was measured. However, there are very few 
exact matches between the datasets. To overcome this, we found the nearest dose (6 exact matches, 
14 nearest matches at maximum differing by 1uM) at which kinome inhibition was profiled for each 
cell sensitivity measurement and connected the two datasets using these dose matches. 
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Figure 2. Feature Selection using an Elastic-net Regression Model against Cancer Cell Line 
Sensitivity. (a) Ridge plot showing the distribution of LASSO coefficient sizes as a metric for 
feature importance, for each feature type (b) Horizontal bar plot showing kinases with the largest 
elastic-net coefficient values, coloured by whether they are defined as “understudied” (Dark) or 
“well-characterized” (Light). 


Additionally, we added cell line specific information to the dataset to complement the drug- 
specific kinome inhibition states. The CCLE database contains baseline transcriptomics data for 
~1500 cancer cell lines, and almost all of the cell lines included in our data set were represented. 
Using this, we further added baseline gene expression into the dataset, now containing kinase 
inhibitor combinations, their inhibition state of the kinome, the cell line sensitivity to their treatment, 
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as well as that cell line’s baseline gene expression. In this way, the dataset connects the kinome 
inhibition states of inhibitor combinations to their cell sensitivity phenotypes. 

The collected dataset represents a total of eight major cancer types, with the majority having ~7 
cell lines represented each, while breast cancer had the most representation (11 cell lines). To ensure 
that the machine learning model downstream could find cancer-specific linkages between the 
kinome and cell sensitivity, we split the dataset into eight individual cancer type datasets and 
conducted all modeling on each data split in parallel. 


2.3. Elastic-Net Feature Selection Reveals Kinome Inhibition States to be Most Informative 


In our collected dataset, kinome inhibition states and baseline gene expression together represent 
~20,000 variables or “features” that could affect the phenotype of cell sensitivity to kinase inhibitors. 
It is both practically prohibitive and ineffective to build models using all available features, and so 
keeping in mind computational efficiency we sought to filter down the dataset to include only the 
most informative features. To accomplish this “feature selection”, we built our machine learning 
pipeline starting with an elastic-net regression [17] model built against the outcome of cell 
sensitivity. This generated coefficients for each feature, with the absolute value of the feature 
coefficient directly proportional to its predictive value for the outcome. We ensured non-informative 
features were not included in modeling by only considering features with non-zero coefficients. We 
fit the model on the entire dataset to visualize a snapshot of the feature coefficients globally. This 
revealed overwhelmingly larger coefficients for kinome inhibition states compared to baseline gene 
expression (Fig 2a), thus indicating that kinome inhibition states were globally more informative 
for cell sensitivity prediction compared to baseline gene expression. 

For downstream model building, the data set was split into a training and testing set five times 
(five-fold cross validation). For the training set data to not have any influence on the test set (to 
prevent data leakage), the elastic net model is fit on only the training data, and features are selected 
within each fold. Parameters for the elastic net model and hyperparameters for the tested model 
types were also tuned this way. 


2.4. Machine Learning Models Can Predict Cancer Cell Line Sensitivity to Combination 
Therapies by Integrating Kinome Inhibition States and Baseline Transcriptomics 


After data set preparation and feature selection, we built machine learning models that can predict 
cell sensitivity to kinase inhibitor combinations. For each cancer type, three machine learning model 
types were tested: random forest, boosted trees (xgboost) and deep neural networks. Xgboost 
performed the best for all cancer types, with type-specific performance largely dependent on 
abundance of data in the training set (Fig 3b). The most abundant cancer type (breast) had the best 
performing model with an R? score of 0.93 (Fig 3b) while the lowest performing model was prostate 
cancer with R? = 0.73. Given our previous lab experience with breast cancer, we chose the breast 
cancer model for downstream experiments and validation. 


Additionally, since the best-performing model was tree-based gradient boosting, we were able 
to further analyze the model using computed feature importance to find the most informative 
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features in the data set based on the feature importance metric. Similar to the feature selection output, 
we saw much higher feature importance scores overall for kinome inhibition states when compared 
to baseline gene expression, and several kinases implicated in breast cancer dysfunction had high 
importance scores, such as MAP2K1/2 and EGFR(Fig. 3c). 
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Figure 3. Development of Models to Predict Cancer Cell Line Sensitivities to Kinase 
Inhibitor Combination Therapies from Kinome Inhibition States. (a) Model performance 
metrics (R-squared) for Random Forest (dots) and XGBoost (triangles). (b) Scatter Plot of 
predicted sensitivity values from the best-performing model vs actual sensitivity values. The red 
line indicates a smooth fit through the data points. (c) Horizontal bar plot showing model 
importance of individual kinase inhibition states by importance values. (d) Horizontal bar plot 
showing model importance of individual baseline gene expression by importance values. 


2.5. Experimental Validation of Model Predictions in a PDX-Derived Triple Negative 
Breast Cancer Cell Line was Successful. 


We demonstrated that machine learning models using the kinome inhibition states of combination 
therapies along with cell-specific baseline gene expression could robustly predict cell sensitivity in 
multiple cancer types. However, to see if these predictive models could extend to real-world 
experiments, we experimentally validated 35 kinase inhibitor combinations in a PDX-tumor derived 
cell line(Fig 4A). 
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Figure 4. Experimental Validation of Model through a Trametinib Combination Screen in 
the WHIM12 Patient-Derived TNBC Cell Line. (a) Schematic showing experimental validation 
pipeline for the WHIM12 PDX-derived cell line. (b) Kinome phylogenetic map showing diversity 
of kinome targeted (red = inhibited by a validated kinase inhibitor combination). (c) Grid of scatter 
plots showing accuracy of top nine tested combinations. For all scatter plots, the dashed line 
indicates where perfect predictions would lie and the red line shows a linear fit through the data. 
Quantitative accuracy is represented by the R-squared score. (d) Scatter plot showing the global 
accuracy of model. (e) Grid of scatter plots showing accuracy of model predicted synergy scores 
compared to experimentally measured synergy scores across two metric types (ZIP, Bliss). (f) Grid 
of heatmap plots showing comparison of predicted vs experimentally measured sensitivity and 
synergy for the highly synergistic trametinib / omipalisib combination. 


High-throughput cell line drug screens have been widely documented to suffer from a lack of 
reproducibility and poor translation to more complex samples like patient tumors. We sought to test 
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whether our model of cell sensitivity in breast cancer, trained on 11 well-characterized immortalized 
cell lines, could effectively predict cell sensitivity in a PDX (Patient-Derived Xenograft) derived 
cell line. We chose the WHIM12 PDX-derived cell line, which was generated from a highly chemo- 
resistant TNBC tumor [18]. Previous experiments in the lab had conducted a drug combination 
screen in the WHIM12 cell line, out of which 35 kinase inhibitors were tested in combination with 
trametinib. Complementary baseline gene expression data was also generated through RNAseq. 
Using these in-house data, we were able to input the unseen WHIM12 gene expression into the 
trained model and predict the cell sensitivity outcomes of the conducted drug combination screen. 
We achieved robust prediction accuracy (Global R? = 0.74 / RMSE = 0.14) in predicting exact cell 
viability in response to treatment with 35 kinase inhibitor combinations, across 64 dose 
combinations (Fig 4c, d). 


2.6. Model Predictions Reveal Known Synergy in trametinib/omipalisib Combination 


The model predictions in the WHIM12 cell line were further interrogated for potential synergy. 
We generated synergy scores for all 35 combinations at each of the 64 dose points using the R 
package SynergyFinder [19] based on four different metrics: Zero-Interaction Potency [10] (ZIP), 
Bliss Independence [20], Highest Single-Agent (HSA), and Loewe Additivity [21]. Additionally, 
we generated similar synergy scores using the actual experimental data generated for validation as 
a comparison. We found a high degree of similarity (Global R? ~ 0.94/ RMSE ~ 0.5) between 
predicted and actual synergy, with trametinib + omipalisib as our most synergistic predicted 
combination, with a ZIP score of ~8 at certain dose combinations (Fig 4e, f). This is significant as 
the model predictions were in a TNBC PDX-derived line, and the trametinib/omipalisib 
combination represents the popular strategy of simultaneously targeting the MAPK and PI3K 


pathways|_ 22]. 


3. Methods 


Data Sources. The kinome profiling data set from the kinobeads assay was downloaded from the 
supplementary materials of Klaeger et al. 2017 [15]. For cancer cell line sensitivity to kinase 
inhibitor combinations, data was downloaded from (1) NCI-ALMANAC: cell sensitivity data was 
downloaded from the NCI wiki database (https://wiki.nci.nih.gov/display/NCIDTPdata/NCI- 
ALMANAC) and (2) Supplementary materials of previous lab combination screens published in 
Beville et al. 2019 [28] and Stuhlmiller et al. 2015 [29]. The CCLE gene expression set 
(“CCLE_expression.csv”’) was downloaded from the DepMap portal 
(https://depmap.org/portal/download/all/) to create the set of cancer cell lines and their gene 
expression characteristics. In-house baseline gene expression data for the PDX-derived WHIM12 
line was downloaded from the GEO repository for the Zawitowski et al. paper[26] (GSE87424). 


Data Preprocessing. The scripts implementing these descriptions are all available through github. 
Klaeger et al. Kinobead Kinase Inhibition Profiles: As previously described [16], we read the 
values from the supplemental data table into R and produced a filtered list of kinase and kinase 
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interactor relative intensity values. We imputed missing values with the default “no interaction” 

value of 1 and truncated likely outlier values to the 99.99 percentile (3.43). 

Creating the Combination Inhibition State Data Set: To create a “combined” inhibition state of 

a given kinase inhibitor combination, we sought to superimpose the inhibition states of two 

individual states at specific doses. There were eight doses measured for each individual inhibitor, 

thus there were 64 possible combinations for each combination. We took the monotherapy kinome 
inhibition states from the Klaeger et al. set and computed a “combined” inhibition state for each 
kinase, based on three different combination schemes: 

1. Simple Multiplicative: The simple conditional probability rule assumes two independent events 
(A and B in Eq. 1). Since the default “no interaction” inhibition value is 1, for kinases that are 
not targeted by both inhibitors simultaneously, the “combined” inhibition state (C’) value is 
simply either one in monotherapy. 

2. Truncated Multiplicative: A minority of measured kinase inhibition states (~1%) have values > 
1 in the Klaeger et al. dataset, a possible artifact from the mass spectrometry measuring process. 
To avoid inflating those values, all >1 values were truncated at 1 and simple multiplication was 
performed as described above (Eq. 2). 

3. Addition: All kinase inhibition states were inverted into “Percent Inhibition” values (A’ and B”), 
where 0 denotes no inhibition and 100 denotes complete inhibition. Then, when two inhibition 
states were combined, they were added together and truncated at a max value of 100 (Eq. 3). 


Eq.1. C'=A*B 
Eq.2. C’ = min(1, A) * min (1, B) 
Eq.3. C’ = min (100, A' +B’) 


All three methods were tested in downstream modeling, resulting in minor variation. Truncated 
multiplied vectors were slightly more predictive (R? score of ~0.01 greater) so we used that scheme 
for all downstream modeling. In this way, we were able to compute hypothetical “combined” 
inhibition states for all possible combinations of ~220 inhibitors, altogether comprising ~2,000,000 
combined inhibition states. 

Dataset of Cancer Cell Line Sensitivity to Kinase Inhibitor Combinations: The cell sensitivity 
dataset from NCI-ALMANAC and previous lab publications were filtered to contain only kinase 
inhibitor small molecules, then summarized over replicates and converted to cell viability (1 = fully 
viable cell and 0 = full cell death). Relevant cancer types were annotated and individual cancer type 
datasets were subsetted for downstream cancer type-specific modeling. 

Matching of Kinase Inhibitors between Inhibition State Dataset and Cell Line Sensitivity 
Dataset: The drug names from each dataset were read into R, and the package Webchem [30] was 
used to retrieve PubChem compound IDs (cid’s). The two sets of drug names were then matched 
based on these reference IDs, with a total of ~100 matches between the two sets. 

Baseline Gene Expression from CCLE: Data was preprocessed as described before [31] from 
the “CCLE_expression.csv” file. Cell line names were matched manually between CCLE and the 
NCI naming scheme. All cell lines represented in NCI-ALMANAC had a match in the CCLE 
database. 

String: The STRING database [32] was processed as described previously [31] to annotate 
kinases and kinase interacting genes. 
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Modeling Techniques: To assess our models we used a random 5-fold cross validation strategy. 
We implemented Elastic-net regression using the glmnet engine [33] for the feature selection 
scheme [17], We compared the performance of three model types using this strategy: random forest 
using the ranger engine [34] and gradient boosting using the XGBoost engine [35]. Model 
performance was assessed by the R-squared value between predicted and actual outcome within the 
cross-validation scheme. For each model type and for the feature selection model, we tuned sets of 
20 hyperparameters to find the best possible performer as follows: (a) Elastic-net: Penalty (0 - 0.1), 
Regularization (0.1-1) (b) Random Forest: Trees (100 - 2000) (c) XGBoost: Trees (100 - 1000), 
Tree Depth (4 - 30). After final model selection, we fit the model on the entire dataset and then made 
predictions on the experimental validation data. 

All of the data and code written to support this paper is available through github 
(https://github.com/gomezlab/kinotype_combination_prediction). 

Experimental Validation. 6x6 dose combination screens were performed in the WHIM12 cell 
line as described in Beville et al. 2019 [28]. Briefly, cells were seeded in 384-well plates and dosed 
with drug after 24h. The screening library was tested for growth inhibition alone or in combination 
with Trametinib across 6 doses: 10 nmol/L, 100 nmol/L, 300 nmol/L, 1 umol/L, 3 umol/L, and 10 
umol/L. 0.1% DMSO was included as the control for growth inhibition on each plate. Plates were 
incubated at 37°C for 96 hours and lysed using CellTiter-Glo Reagent (Promega, catalog. no. 
G7570). Luminescence was measured using a PHERAstar FS instrument and growth inhibition was 
calculated relative to DMSO-treated wells. 


4. Discussion 


Kinase inhibitors are one of the fastest growing drug classes for cancer therapy, with ~62 FDA 
approved in total against neoplasms [2]. With 500 potential druggable targets, there is significant 
interest in streamlining the kinase inhibitor screening process. We have previously introduced 
[16,23,24] the idea that the full spectrum of a given inhibitor’s effect on the kinome as measured by 
recent advances in kinobead-competition/MS technology [15] can be represented as a “kinome 
inhibition state”, i.e. a vector representing the effect of a given inhibitor on the kinome as a whole. 

In this work, we have extended this idea to represent the kinome inhibition state of a combination 
of inhibitors, using a multiplicative probability model to “combine” the inhibition states of two given 
kinase inhibitors. By generating these “combined” inhibition states, we can vastly expand the search 
space targeted by inhibitor monotherapies, sampling all possible combinations of currently available 
therapies. To accomplish this, we used publicly available drug-kinome interaction data to generate 
snapshots of the combined effect of a combination therapy on the protein kinome. We then linked 
these kinome inhibition states of inhibitor combinations to cancer cell sensitivity phenotypes to 
combination treatment, creating a framework for predicting the efficacy of combination therapies in 
different cancer types. 

We fit tree-based machine learning models on this linked data set to robustly predict precise 
cancer cell line sensitivity and synergy for untested kinase inhibitor combinations therapies and 
validate those predictions in complex patient derived samples. gradient-boosted tree models were 
highly accurate across cancer types (R? 0.75-0.93), comparable to two recent neural-network driven 


286 


Pacific Symposium on Biocomputing 2024 


attempts to predict cell line response to drug combinations [9,11]. We chose to validate our model 
predictions in the PDX-dervied WHIM12 line, reasoning that PDX-derived cell lines retain many 
of the molecular and genetic features of the xenografted original tumors. We were able to show that 
the models performed robustly on novel gene expression data (R? ~0.74) , representing its ability to 
extend to complex and clinical-adjacent samples compared to well-characterized cell line data. 

One of the strengths of tree-based models is that they are considered to be interpretable through 
feature importance computation [12,25]. Using this, we were able to investigate the “black box” and 
query which specific kinase inhibition states and baseline genes were most predictive of cell 
sensitivity. We found that for the breast cancer model, the inhibition of the kinases MAP2K1/2 were 
the most informative by far. This is intuitive considering the most abundant kinase inhibitor in the 
dataset is the allosteric MEK inhibitor trametinib, but it must be noted that MEK inhibition is always 
only just one half of the kinome targeting in the combination. There has been increasing clinical 
interest recently in targeting the PI3K and MAPK pathways [22], and our lab has shown before that 
MEK 1/2 inhibition in TNBC by trametinib induces widespread transcriptional adaptation, and that 
there is potential for clinical efficacy in complementary kinome targeting with trametinib [26]. Since 
our model’s sensitivity predictions can effectively simultaneously predict synergy, our top synergy 
prediction for breast cancer according to the ZIP metric was trametinib and omipalisib, which we 
were able to validate experimentally in the WHIM12 line. This indicates that from the breast cancer 
screening data, the model was able to learn that targeting the complementary PI3K and MAPK 
pathways is effective and synergistic in TNBC. 

Interestingly, the predicted high-synergy combination of trametinib/omipalisib was recently in 
phase I clinical trials for advanced solid tumors but failed due to patient intolerability [27]. This 
highlights some limitations of our modeling approach. Ideally, kinome inhibition state would be one 
of many different drug modalities included for response prediction, and we plan to further expand 
these models in the future by considering toxicity, drug structure and cancer-describing multi-omic 
data types not limited to baseline gene expression. Additionally, in this proof-of-concept study we 
utilized multiplicative probability models to generate the “combined” inhibition state of two 
inhibitors on the kinome, by assuming that the inhibition of a given kinase is mutually exclusive 
from that of other kinases. We know that kinases function physiologically as part of complex 
signaling networks, and their inhibition may have downstream effects on other kinases and signaling 
pathways. To address this limitation, future models will incorporate more biologically 
representative schemes to hypothesize combined kinome inhibition states. 

In summary, through this work we demonstrate the development of a framework for predicting 
the efficacy of combination therapies in different cancer types using just kinome-drug interactions 
and baseline gene expression. We generated the combined "kinome inhibition state" and linked these 
states to cancer cell sensitivity phenotypes. First, we were able to show that a given combination 
therapy’s cancer-agnostic interaction with the kinome was far more informative than baseline 
genomics in predicting downstream response. This is intuitive fundamentally, as drug-protein 
interactions are the primary means of drug effect on physiology, but this type of data is still 
underutilized in computational screening approaches. We then used machine learning models to 
predict cell line sensitivity and synergy for untested kinase inhibitor combination therapies and 
validated those predictions experimentally in complex patient derived samples. 
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Abstract 


Assembling an “integrated structural map of the human cell”! at atomic resolution will require 
a complete set of all human protein structures available for interaction with other biomolecules 
- the human protein structure targetome - and a pipeline of automated tools that allow quan- 
titative analysis of millions of protein-ligand interactions. Toward this goal, we here describe 
the creation of a curated database of experimentally determined human protein structures. 
Starting with the sequences of 20,422 human proteins, we selected the most representative 
structure for each protein (if available) from the protein database (PDB), ranking structures 
by coverage of sequence by structure, depth (the difference between the final and initial residue 
number of each chain), resolution, and experimental method used to determine the structure. 
To enable expansion into an entire human targetome, we docked small molecule ligands to our 
curated set of protein structures. Using design constraints derived from comparing structure 
assembly and ligand docking results obtained with challenging protein examples, we here pro- 
pose to combine this curated database of experimental structures with AlphaFold predictions? 
and multi-domain assembly using DEMO2? in the future. To demonstrate the utility of our 
curated database in identification of the human protein structure targetome, we used docking 
with AutoDock Vina* and created tools for automated analysis of affinity and binding site 
locations of the thousands of protein-ligand prediction results. The resulting human targe- 
tome, which can be updated and expanded with an evolving curated database and increasing 
numbers of ligands, is a valuable addition to the growing toolkit of structural bioinformatics. 


Keywords: ligand binding; reverse molecular docking; high-performance computing 


© 2023 The Authors. Open Access chapter published by World Scientific Publishing Company and 
distributed under the terms of the Creative Commons Attribution Non-Commercial (CC BY-NC) 
4.0 License. 
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1. Introduction 


The structures of proteins determine their ability to interact with other biomolecules, which is 
often at the heart of cellular functions and dysfunctions. Massive structural proteomics efforts 
have made large numbers of protein structures available in the protein databank. While the 
coverage still falls short of completeness for any single organism, including human and other 
model organisms, let alone non-model organisms, the recent advent of molecular modeling 
approaches that rival experimental structure determination in accuracy in some cases,” now 
allows us to start imagining complete datasets of the entire structural proteome of an organ- 
ism. Such datasets would allow us to start looking at the effects of natural and chemically 
synthesized small molecules in the context of all possible interactions. The availability of data 
and computing resources as well as development of new computational approaches are revolu- 
tionizing the field of drug discovery.® It is becoming increasingly clear that the traditional view 
of one drug-one protein target is too reductionist: Many successful drugs have multiple targets 
(for example, the popular anti-diabetic drug, metformin), and many metabolites do not only 
interact with the enzymes that use them to carry out chemical reactions but often thousands 
of other proteins.’ Thus, target discovery is becoming increasingly important also for drug 
discovery, and reverse docking (i.e. binding of a given ligand to many proteins, as opposed to 
docking many ligands to one protein target) plays a major role in this field.8 Looking at the 
entire set of human proteins that a ligand can potentially interact with - the human targe- 
tome - would allow us to answer fundamental questions about the functioning of cells while 
also improving drug discovery, drug repurposing and predictions of drug targets and toxicity. 
Finally, we may begin looking at complex mixtures of ligands with biological efficacy, such as 
natural extracts with positive health effects like lemon juice? and environmental pollutants 
such as asphalt,!? comprised of thousands of individual compounds." 

Currently, docking and even reverse docking is carried out largely with limited subsets of 
protein structures!*:.!3 To enable future systematic analysis of any biomolecular ligand with 
an organism’s complete set of proteins, we describe an approach to create a database that 
contains a single representative of the optimal structure for each human protein. Our initial 
strategy is centered around devising a biologically pertinent methodology to rank experimen- 
tally derived protein structures as outlined in Figure la. We use the UniProt database!4 as 
our reference for all human protein sequences and retrieve the list of structure files from the 
protein databank. To select the most representative structure, we adopted three key param- 
eters for evaluation: coverage, depth, and resolution of the structures. “Coverage” refers to 
the count of residues in the protein’s structure, indicating the structure’s completeness. We 
prioritized this parameter due to its importance in understanding the overall integrity of a 
protein. Nevertheless, we encountered situations where a protein’s structure, despite having 
less coverage, offered more meaningful insights due to its residue information being spread 
over a larger range of amino acids. To account for this, we introduced a novel metric, “depth”, 
which calculates the discrepancy between the maximum and minimum residue numbers. After 
finally ranking by resolution, we obtained a list of 7606 unique human protein structure files, 
available on our GitHub page Here. 

In the long term, we want to create a complete database to predict where and with what 
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affinity different ligands bind to the human targetome. This will require automated tools to 
analyze the results obtained from docking ligands to human protein structures. It will also 
require supplementing experimentally determined structures with predicted structures. We 
here outline such methods and highlight design considerations using comparisons of known and 
predicted structures in general, and a specific challenging protein example, the insulin receptor 
(IR), in the context of structure assembly and ligand docking results. Based on this analysis, 
we here propose a pipeline that incorporates experimental structures, AlphaFold predictions,” 
multi-domain assembly using DEMO2,? docking with AutoDock Vina* and automated analysis 
of affinity and binding site location using the center of mass comparisons as well as Silhouette 
Score clustering optimization of predicted ligand volume overlap to classify binding pocket 
numbers and locations for a given protein-ligand pair, and across many proteins and many 
ligands. Our targetome-oriented, synergistic pipeline will augment protein structure and ligand 
interaction prediction practices. The current stage of implementation of this pipeline is the 
curated database of experimentally determined human protein structures, as well as the code 
used to create the database and to analyze the docking results, available here. 


2. Materials and data sources 


An initial naive download sourcing a spreadsheet listing experimental structures ignored spe- 
cific chains and automatically chose the first in lists of multiple PDB codes for a given protein. 
This led to over 10% of the downloads being multiples of the same structures. In addition, 
these files would often have multiple models or chains, which either crashed the pre-processing 
codes due to inappropriate bounding box sizes or yielded huge search spaces that crashed the 
docking runs. The careful revision of the table -described in the following section- addressed 
most of these cases. Table 1 reflects the impact of these revisions, comparing the results of 
docking the ligand kaempferol against the full suite of downloaded structures. Out-of-memory 
and very large positive “overflow” affinity outputs indicated the two modes of run failure 
described above. 


Table 1: Comparison of ligand kaempferol docking results from original naive scrape and then 
after table revision with specified chains following protocol shown in Figure 1a. 


Statistic Original dataset | Improved dataset 
PDBs 6865 7529 

Out of memory errors 288 0 
Overflow affinities 399 244 

Avg bounding box size 557279 212550 
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3. Methods 
3.1. Database Creation 


An overview of the database creation is shown in Figure 1a. First, we downloaded a compre- 
hensive database comprising all 20,422 human protein sequences from the UniProt database.!4 

In the current implementation, we retained only those UniProt IDs with at least one ex- 
perimental structure associated with it and a file deposited in the Protein Database (PDB).° 
This filtering criterion excluded 12,606 proteins, leaving 7,816 unique UniProt IDs in this sub- 
set, many of which were associated with multiple PDB files. To select the best representative 
structure, we defined several ranking criteria. Sometimes structures miss portions of the se- 
quence, even if they were present during crystallization, often due to flexibility. This can be in 
loop regions, or at the ends. Often, specific domains have been chosen to represent a portion 
of the sequence. Because the structures of missing loop regions are typically ill defined, there 
is a benefit in having a larger stretch of the sequence covered, even if the total coverage is 
reduced by these missing loop regions. We wanted to have measures that capture both scenar- 
ios. Coverage refers to the total number of residues of a sequence that are associated with xyz 
coordinates in a sequence, while depth refers to the difference between the beginning and end 
of the structure, regardless of how many residues are missing in between. Moreover, for each 
PDB file corresponding to a UniProt ID, the scraper retrieved the resolution, the experimental 
method used (Electron Microscopy, X-ray crystallography, and NMR), and the chains of each 
PDB file. The latter was essential as a single PDB file can encapsulate multiple proteins. Thus, 
to compile the required information for this ranking, we designed a web scraper to extract 
content from the UniProt database.'* Each DataFrame encompassed specific information for 
each protein structure, including: 


(1) PDB ID 
(2) Resolution 

(3) Chains and their associated locations 

(4) Experimental method used for structure determination 

(5) Whether alpha carbons were the only present atom in the PDB file 


While resolution and chain information was sourced directly from the UniProt database, cov- 
erage and depth information for each PDB file necessitated the scraping and local downloading 
of all PDB structures related to our 7,816 unique proteins from the RCSB PDB database.° 210 
UniProt IDs lacked any PDB formatted structure available within the RCSB PDB? database, 
thereby reducing our working dataset to 7,606 unique proteins. Computing coverage involved 
iterating through the PDB file and enumerating the unique residues for each chain corre- 
sponding to the UniProt ID. Meanwhile, the depth metric was derived by calculating the 
difference between the final residue number and initial residue number of each chain within 
the associated PDB file. For example, if a PDB file started at residue 42 and ended at residue 
200 the depth would be 158. In instances where multiple experimental methods for structure 
determination were utilized, we excluded NMR structures for a given UniProt ID because in 
protein NMR, there is no parameter identical to resolution,'’’ complicating comparison with 
X-ray and cryo-EM structures. Ranking involved the following steps: 
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(1) Organize the DataFrame in a hierarchical manner based on the coverage, depth, and 
resolution of each PDB file. 

(2) Purge structures that consist solely of alpha carbons provided that other structures are 
present. 

(3) Implement the following decision-making rules iteratively until the top four structures 
remain unchanged: 


(a) If the coverage difference between a higher-ranked PDB file and a lower-ranked one falls 
within a +/- 20 amino acid range, assess the depth of the structures and adjust the 
ranking accordingly, favoring the structure with greater depth. This allows structures 
with missing residues in loop regions to be ranked highly. 

(b) In the case where the resolution of a higher-ranked PDB exceeds 4, rearrange the rows 
to rank the structures according to their resolutions in descending order. This rule 
balances coverage and resolution. 


Upon securing a ranked list of PDB files for each UniProt ID, we extracted the highest- 
ranked PDB file for each respective UniProt ID and its associated chain/location information. 
For every top-rated PDB structure, all missing residues were obtained using the PDBParser 
package from the Biopython library.!° Two UniProt IDs presented missing chain information 
and were subsequently excluded from our dataset, rendering us with 7,604 unique proteins as 
visualized in Figure 1a. 

To obtain the AlphaFold complement of the experimentally known structures, we lever- 
aged AlphaFold’s API? to extract all associated AlphaFold models corresponding to the 7,604 
UniProt IDs in our curated dataset. Using our top-ranked PDB file and the data of missing 
residue numbers for a specific UniProt ID, we computed AlphaFold’s predicted confidence 
scores for both missing and present residues. Subsequently, we documented the AlphaFold 
residue confidence score for every residue, irrespective of its status (missing or present), in the 
highest-ranked PDB structure. We further computed the average AlphaFold confidence score 
for both missing and present residues in the top-ranked PDB structure for each UniProt ID 
as shown in Figure 1a. 


3.2. Multtdomain Structure Prediction With DEMO2 


A protein structure dataset based on experimental structures is only limited by the availabil- 
ity of structural information for some parts of the sequence. Towards the aim of a complete 
human protein structure dataset, we will need to combine experimental data available for 
different parts of the sequence and/or integrate predictions of the missing parts. We evalu- 
ated the feasibility of using protein-protein docking to combine structural information from 
different sources into a complete model for a given UniProt sequence. We used DEMO2 soft- 
ware.? Neighboring domains were sequentially submitted to DEMO2 as pairwise structure 
files. For instance, in the case of the insulin receptor (IR), described in the results, the L1 
and CR domains were initially introduced into DEMO2, followed by the insertion of CR and 
L2 domains. The output generated from both inputs was then transported into PyMol, where 
the structures were aligned based on the “common” domain — in this case, the CR domain. 
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(b) Sequential steps undertaken to derive the final dataset 
Fig. 1: Assembly and Composition of the Dataset. 


This methodology was pursued iteratively until all desired domains were incorporated into 
the aligned structure. 


3.3. Analysis of Small Molecule Docking Positions 
3.3.1. Prediction of Small Molecule Ligand Binding Sites with AutoDock Vina 


To identify putative ligand docking positions and quantify their relations to highly dense 
protein pocket regions, we utilized ligand-protein docking coordinates obtained from AutoDock 
Vina.* The table of structures was parsed for PDB code and specific chains. The PDB code 
was used to scrape from resb.org. The chain was subsequently used to excise the section of the 
PDB to use in the docking. To coordinate large-scale runs, individual AutoDock Vina scripts 
were automatically constructed, which employed PyMOL to determine the center of mass and 
bounding box for each protein, with these values stored in a configuration file. reduce and 
prepare scripts on protein and ligand pdbs preceded the docking run in the pipeline. These 
were sourced from the ADFR Suite of tools, although an updated, more robust, reduce script 
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was later sourced from another repo (https: //github.com/rlabduke/reduce).'” 

The AutoDock Vina code was run in batch mode using job array submissions to the 
SLURM scheduler on Arizona State University’s Agave and Sol clusters.'® Most jobs were 
completed using a single CPU and 4GB of RAM. Figure 2 presents a logarithmic plot of 
runtimes (in seconds) versus ligand size (in atoms). The mean runtimes of these were strongly 
correlated (a = 0.746) to number of atoms. As ligand size increases, the greater variation 
in runtime may be attributable to the number of flexible bonds or the total volume. To 
contrast, protein size in atoms and mean runtimes were uncorrelated. Cumulative runtime for 
a ligand across 7, 527 proteins could take from hundreds to thousands of hours, but distribution 
across the 18,000 available cores on Sol dramatically reduced wall time. Outputs were stored 
in a directory structure with ligands at the top tier, each having several thousand protein 
directories containing affinity and output structure files for the top tier ligand. 
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Fig. 2: Log plot of mean runtimes (in seconds) across 7,527 proteins versus ligand size (total 
atoms) While there was large variation in runtimes, indicated by error bars, the means were 
strongly correlated to ligand size. 


3.3.2. Point Cloud Clustering & Visualizations Created Using Delaunay Triangulation 


We analyzed the overlap of ligand docking positions using collections of three-dimensional 
point clouds that we rendered as surfaces by applying Delaunay triangulation. Delaunay tri- 
angulation is a useful method for plotting an arbitrary collection of coordinates as volumetric 
bodies. To further examine the spatial overlap of ligand-protein docking models for individual 
ligand-protein pairs, as well as the spatial overlap of docking positions for potentially com- 
peting ligands and their respective proteins, we deployed K-means clustering optimized using 
silhouette analysis. Silhouette analysis evaluates the density and separation between clusters, 
calculating a score by averaging the silhouette coefficient for each sample, which is computed as 
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the difference between the mean intra-cluster distance and the mean nearest-cluster distance 
for each sample, normalized by the maximum value. The scores range between —1 and +1, 
where +1 indicates high separation of clusters and —1 indicates that the coordinates may have 
been assigned to the wrong cluster. By taking the highest-scoring configuration of clusters, 
we grouped ligand docking models into “locations” or “pockets.” 

As a metric for percent overlap of the volumetric surfaces rendered from the docking 
coordinates, we used Equation 1, where m is the number of models contained in an AutoDock 
Vina output file for a ligand-protein pair and k is the optimal number of clusters determined 
by the K-means algorithm. Fewer clusters result in a greater percent overlap, and in cases 
where the ratio of clusters to models is 1, the percent overlap is 0. 


Percent Overlap(m,k) = (1 — D * 100 (1) 


3.3.3. Center of Mass 


PyMOL routines were employed for the center of mass calculations, which were used to prepare 
AutoDock Vina configuration scripts and in the post-processing of ligands for analysis. 


4. Results and Discussion 
4.1. Human Protein Structure Database Creation 


There are 20,422 unique human protein sequences in UniProt,'* out of which 7,816 have at 
least one PDB file associated with it. A protein structure dataset based on experimental 
structures only is limited by the availability of structural information for some parts of the se- 
quence. However, this number overestimates the availability of structural information because 
often only a single domain of a given human protein has been crystallized. The scale of this 
problem is highlighted in Figure 3, which compares the entire sequence lengths of the 20,422 
human proteins to the coverage of sequences retrieved from the PDB. We can see that there 
is a drastic shift to a smaller number of amino acids covered in experimentally determined 
protein structures. Towards the aim of a complete human protein structure dataset, we will 
need to combine experimental data available for different parts of the sequence and/or inte- 
grate predictions of the missing parts. AlphaFold? provides a rich source of protein structure 
predictions that could be used, but we can see from Figure 3 that the portions of sequences 
missing in existing protein structures are also the ones that it has least confidence in. 


4.2. Database Expansion Based on Multidomain Protein Interactions 


Ultimately, we wish to create a database of structures that covers the entire human proteome, 
and this will require inclusion of predictions. To illustrate the challenges and feasibility of 
expanding our dataset with AlphaFold predictions and/or by piecemealing domains of a given 
single UniProt ID for which domain structures have been determined independently in differ- 
ent experiments, we utilized the insulin receptor (IR) as a representative example. The IR is 
an important protein given its role in diabetes and the regulation of many cellular pathways, 
but it is also an experimentally challenging protein because it is a large, multimeric, multido- 
main, flexible membrane receptor. Thus, to this date, a full-length structure covering the entire 
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Fig. 3: (a) Distribution of protein length in UniProt in blue and the manually calculated 
coverage in green in the PDB. (b) AlphaFold’s prediction confidence for amino acid residues, 
with missing residues represented in blue and present residues in red, in the context of the 
highest-ranked structure from the Protein Data Bank (PDB) taken from our dataset 


UniProt sequence P06213 has yet to exist despite many efforts. Details of the different PDB 
files providing structural information and coverage for extracellular insulin binding domains, 
i.e., transmembrane and cytoplasmic kinase domains, have been reviewed.'?:2° 6PXV provides 
the most extensive coverage”! representing the cryo-EM structure of the IR in complex with 
four insulin molecules. Although the full-length sequence was subjected to experimental analy- 
sis, structural data was only obtained for the extracellular domain.*! Because the IR is a dimer, 
chains A and C in 6PXV are identical. Therefore, we focused our analysis solely on chain A. 
Initial steps involved utilizing PyMOL to visualize the distinctions between the experimen- 
tally derived structure of the IR and its predicted AlphaFold counterpart (AF-IR), depicted 
in Figure 4. Subsequently, we dissected both structures into their constituent domains: the 
leucine-rich repeat domains (L1-L2), a cysteine-rich region (CR), fibronectin type-II] domains 
(FNIII-1-3), and the transmembrane domain (TM). Neighboring domains were sequentially 
inputted into DEMO2 (see Methods). For instance, the L1 and CR domains were initially 
introduced into DEMO2, followed by the insertion of CR and L2 domains. The output gen- 
erated from both inputs was then transported into PyMol, where the structures were aligned 
based on the “common” domain - in this case, the CR domain. This methodology was pur- 
sued iteratively until all desired domains were incorporated into the aligned structure. We can 
see from Figure 4 that DEMO2 not only reproduces the experimental cryo-EM structure as 
expected but also improves upon the initial AlphaFold prediction obtained when using the 
entire sequence. The integrated AlphaFold-IR structure portrayed in Figure 4 is noticeably 
improved compared to AlphaFold’s initial prediction. A significant portion of the error in both 
DEMO2 predicted structures Figure 4 6PXV and AF-IR structures can be attributed to an 
unconnected alpha helix from the FN3-2 domain. 


4.3. Database Expansion Based on Protein-Ligand Interactions 


Because our long-term goal is to view the human structure proteome as the targetome for 
small molecule ligands (and ultimately other biomolecules, but for now, we focus on small 
molecules), we used our protein structure datasets for docking more than 50 different ligands 
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PDB 6PXV Chain A Alpha Fold Predicted IR Chain A Integrated 6PXV using DEMO2_ Integrated AF -IR using DEMO2 


Residue Location/Color Key: 


FN3-2: 621 — 794 “FB: 795- 896 TM: 897-937 Not Present in 6PXV 


Fig. 4: Experimentally determined and predicted structures of IR. 


of different sizes and physicochemical properties. We used AutoDock Vina (see Methods) and 
encountered a number of errors for the structures in our dataset, enumerated in Table 1. 


4.4. Automated Analysis of Ligand Prediction Results 


Even when looking at a single ligand, we now have thousands of AutoDock Vina prediction 
results. In the future, we plan to look at complex mixtures of ligands, which will result in 
even larger ligand docking datasets. Each AutoDock Vina result is a list of up to 9 docking 
poses for a given ligand-protein pair,4 which vary by the details of the pose of the ligand based 
on bond rotations and interactions with different parts of the protein, resulting in different 
predicted locations and/or affinities. We know from many examples, that taking the best 
affinity prediction may miss biologically meaningful ligand binding pockets, which could in 
fact be representing allosteric and orthosteric pocket(s).???3:24 Furthermore, bond rotations 
in the ligand can result in drastic changes in predicted affinity, while the overall location of 
the binding pocket remains similar. To capture these insights on a large scale, we propose two 
approaches to automated analysis of the AutoDock Vina prediction based on the volume and 
center of mass of the ligands, respectively. 


4.4.1. Ligand-volume based binding pocket location analysis 


The development of a method to analyze AutoDock Vina prediction results by ligand volume 
overlap is shown in Figure 5. Volumetric analysis of four different ligand-protein pairs is shown 
to exemplify different common scenarios observed in AutoDock Vina predictions. An example 
of a low percent overlap in the volumetric surface plot for ligand OA1 obtained from protein 
structure 3qtc, when docked to 119h (bovine rhodopsin, a G protein-coupled receptor), is shown 
in (a). We can see that the 9 predicted docking poses cluster into 5 easily distinguishable 
binding pockets. The opposite extreme is shown in (b), for ligand 00A obtained from PDB 
file 3cw8, docked to the same structure as in (a), 119h. All 9 docking poses are found in the 
same location, with 100 percent overlap. Other ligand-protein pairs show less clear results, 
for example, Benzo(a)pyrene (BaP), a hydrophobic ring structure ligand (c) and apigenin, 
a flavonoid ligand also with hydrophobic ring structures but with several oxygen-containing 
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groups (d), when docked to the same protein (1ksg). Both ligands are of comparable size but 


different physicochemical properties, and both show overlap that is not easily distinguishable 
with this approach. 
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Fig. 5: Volumetric surface plots for different ligands or from original protein:docked protein 
pairs: (a) OA1 3qtc:119h, (b) 00A 3cw8:119h, (c) Benzo(a)pyrene:1ksg, (d) apigenin:1ksg. 


We clustered the volumetric overlap results using an optimized KMeans clustering algo- 
rithm (see Methods). The result is shown for the interaction of BaP with 1ksg in Figure 6a,b. 
We can see that we now obtain clear separation into two clusters, representing two distinct 
pockets in well-separated domains of the 1ksg protein structure, shown in Figure 6c. 


=- Average Silhouette Score, 


(y) p1009 Z 
k 


< 225 
10 . 250 

ro 6 = Hs D v o of o2 o o4 o5 06 o7 o8 a9 10 
Oop, 20 <Se 300 oð silhouette coefficient values 
11; 25 2 09 


(a) a (b) 


Fig. 6: Optimized Clustering Algorithm Deployed on BaP Ligand. (a) Number of Clusters = 
2. Optimized using Silhouette Score. (b) BaP Models Percent Overlap = 87.5%. (c) Pymol 
representation of BaP, apigenin, and GTP in 1ksg structure. 
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4.4.2. Ligand-center of mass based binding pocket location analysis 


A complementary approach to the volumetric overlap analysis is to reduce the complexity of 
ligand description to represent each pose by its center of mass. The result of this analysis 
for the same ligand:protein pair BaP:1ksg and apigenin:1ksg is shown in Figure 7. We can 
see that even in the lowest resolution representation of the ligand, where the coordinates of 
each atom in the molecule were collectively replaced with a single coordinate for the center of 
mass, the separation between pockets is not entirely clear. Furthermore, we can see that the 
known ligand binding pocket for the ligand that’s actually bound to 1ksg, GTP, is located in 
the pocket on the top, which carries an overall lower predicted affinity than the regions on 
the right-hand side of the protein. To see how the pockets observed with these three ligands 
compare to a larger set of 50 ligands, we clustered the results using DBSCAN. They formed 
eight distinct clusters, with clear preferences for 4 of these pockets. The DBSCAN analysis 
was run over the entire set of proteins to create a distribution of cluster counts. From the left, 
this distribution sharply peaked at 7 clusters with a slowly decreasing long tail to the right. 
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Fig. 7: Center of mass for apigenin ligand (a) and natural ligand BaP (b) docked to 1ksg. In 
(c), center of mass for 50 ligands are clustered with DBSCAN. Structure as in Figure 6c. 


5. Conclusions and Future Work 


In an era where assembling an “integrated structural map of the human cell”! at atomic 
resolution is no longer out of reach, cell structural bioinformatics will need to reconcile two 
extreme views of biomolecules inside cells: “selective” interaction of high-affinity ligands with 
single protein targets versus “everything binds to everything” the deciphering of which requires 
quantification of ligand and protein concentrations to determine chemical equilibria of binding. 
Our long-term goal is to assist this task and ongoing cell structural bioinformatics efforts by 
developing a human protein structure targetome database and a pipeline of automated tools 
that allow quantitative analysis of millions of protein-ligand interactions. Towards this goal, 
we present the docking of our current version of the human protein targetome to ligands using 
AutoDock Vina. We developed two complementary, automated analyses of affinity and binding 
site location using the center of mass comparisons, which can identify clusters at a coarse- 
grained level but ignores the size and shape of the ligands, as well as Silhouette Score clustering 
optimization of predicted ligand volume overlap, suitable for detailed analysis of ligand overlap 
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when this level of detail is needed. In the future, we plan to use the human targetome and its 
ligand binding information to make predictions on the competition of ligands with different 
affinities to gain insights into challenging problems such as regulation of metabolic pathways, 
interactions with complex mixtures of nutrients and pollutants, and predicting off-target effects 
of drugs. With millions of known small molecules from natural sources and large numbers of 
ligands that can be synthesized in the laboratory, this pipeline will complement projects where 
experiments alone cannot reach the scale needed to gain biological insights. 

Each iteration of the set of the structures comes with limitations. Our current dataset 
has the major limitation that it only represents a fraction (7606 of 20422 = 37%) of all hu- 
man proteins. Currently, all structures are experimentally determined, while future iterations 
will also include predictions. To illustrate how predictions can be incorporated, we used an 
example, the insulin receptor, with sequential assembly of domains from N to C terminus. 
These strategies can be improved, for example, a sensitivity analysis for the sequence with 
which domains are assembled can be carried out. Other structure prediction and assembly 
strategies can be used that are specialized for the type of protein or domain or structural 
element, such as transmembrane helices. Users of the current and future protein structure 
datasets can further filter them if more uniform data are required or if the focus is on a given 
location, such as extracellular or a given subcellular compartment. Other limitations include 
the differences in quality of different structures, the lack of water molecules, ions and other 
solvents such as lipids, all known to be important contributors to ligand binding. This dataset 
can be subjected to future improvements in methods or filters as needed for a given use case. 

The focus (and implementation status) of the current paper is the development of the 
curated database and tools for its analysis if it is used in target identification using tools such as 
AutoDock Vina.* The need to chose a method for docking of ligands presents another inherent 
limitation in this work. Autodock Vina,* for example, is very widely used and compares well 
with other methods,” but reverse docking in general suffers from large false positive rates 
due to limitations in scoring functions.?° However, in most cases, a proper gold standard for 
target discovery is absent as it is typically unknown which proteins are true negatives (i.e., 
are not targets). The explosion in new computational methods using machine learning and 
artificial intelligence® can be used to replace or complement the reverse docking approach 
using Autodock Vina or related methods for example with state-of-the-art deep learning tools 
for ligand binding pocket predictions. The goal of the curated protein structure database 
described here was to improve coverage of the human structural proteome, while keeping the 
quality of the dataset as high as possible with state-of-the-art in data and tool availability to 
enable applications in cell structural bioinformatics. 
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Recently, drug repurposing has emerged as an effective and resource-efficient paradigm for AD 
drug discovery. Among various methods for drug repurposing, network-based methods have 
shown promising results as they are capable of leveraging complex networks that integrate 
multiple interaction types, such as protein-protein interactions, to more effectively identify 
candidate drugs. However, existing approaches typically assume paths of the same length 
in the network have equal importance in identifying the therapeutic effect of drugs. Other 
domains have found that same length paths do not necessarily have the same importance. 
Thus, relying on this assumption may be deleterious to drug repurposing attempts. In 
this work, we propose MPI (Modeling Path Importance), a novel network-based method 
for AD drug repurposing. MPI is unique in that it prioritizes important paths via learned 
node embeddings, which can effectively capture a network’s rich structural information. 
Thus, leveraging learned embeddings allows MPI to effectively differentiate the importance 
among paths. We evaluate MPI against a commonly used baseline method that identifies 
anti-AD drug candidates primarily based on the shortest paths between drugs and AD in the 
network. We observe that among the top-50 ranked drugs, MPI prioritizes 20.0% more drugs 
with anti-AD evidence compared to the baseline. Finally, Cox proportional-hazard models 
produced from insurance claims data aid us in identifying the use of etodolac, nicotine, and 
BBB-crossing ACE-INHs as having a reduced risk of AD, suggesting such drugs may be viable 
candidates for repurposing and should be explored further in future studies. 


Keywords: Alzheimer’s Disease; Drug Repurposing; Machine Learning. 


1. Introduction 


Alzheimer’s Disease, denoted AD, is a progressive neurodegenerative disorder that accounts 
for 60%-70% of dementia cases and affects more than 50 million people worldwide today.'? 
Given the large number of affected individuals and AD’s life-threatening nature,®? extensive 
resources have been dedicated to developing AD-modifying drugs. Since 2003, inefficacy or 


© 2023 The Authors. Open Access chapter published by World Scientific Publishing Company and 
distributed under the terms of the Creative Commons Attribution Non-Commercial (CC BY-NC) 
4.0 License. 
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toxicity has accounted for a 95+% failure rate among candidates evaluated for AD treatment.*° 
Furthermore, none of the current US Food and Drug Administration (FDA)-approved AD drugs 
are curative; they only slow disease progression. Because of the immense resources required to 
conduct clinical trials,° the numerous failed clinical trials have necessitated the development of 
a more resource-efficient method for AD drug discovery. In the last decade, the identification of 
new therapeutic indications for existing FDA-approved drugs, referred to as drug repurposing,’ 
has emerged as an effective and resource-efficient paradigm for drug discovery. This is an 
attractive option as the toxicity, pharmacokinetics, and pharmacodynamics of FDA-approved 
drugs have already been thoroughly investigated by previous clinical trials.” 

Recently, the curation of comprehensive drug databases has enabled the development of 
computational methods for AD drug repurposing.'? ‘4 Among all the methods, network-based 
methods have shown promising results and emerged as a popular approach.13-15-16 Network- 
based methods utilize comprehensive protein-protein, drug-target, and AD-protein interactions 
to effectively reveal potential therapeutic effects of drugs on AD. Though promising, existing 
methods!? measure the therapeutic effects of drugs on AD primarily using count and length of 
the paths connecting drug nodes and the AD node in the network. Paths of the same length 
are considered equivalently effective at identifying the therapeutic effect of drugs by these 
methods. However, in other domains, paths of the same length have been shown to exhibit 
substantially different levels of importance.!”!® As such, assuming equal length paths have 
equal importance could be detrimental to effective drug repurposing for AD. 

In this work, we propose a novel method to conduct drug repurposing for AD, MPI (Modeling 
Path Importance), to address this limitation. Similar to existing methods,'*!4 MPI leverages 
the interactions between drugs and AD via proteins as indications of the potential therapeutic 
effects of drugs on AD. Based on the interactions, MPI introduces a scoring function to score and 
rank drugs for their anti-AD effectiveness. MPI is unique in that it learns node embeddings?’ 
and prioritizes important paths via these learned embeddings. Recent work? has shown that 
the learned node embeddings can effectively capture the rich structure information within 
a network. Thus, scoring paths using node embeddings allows MPI to utilize the network 
structure information to better prioritize paths for effective AD drug repurposing. Specifically, 
in this study, MPI leverages DeepWalk,”! a widely used network learning approach, to generate 
node embeddings. Edges are scored using a normalized dot product between the learned 
node embeddings; paths and drugs are scored by multiplying individual edge scores. Note 
that because MPI serves as a general framework, other network learning approaches, such as 
Node2Vec”? and graph neural networks,?? could also be easily incorporated to generate node 
embeddings. 

In this study, we construct a network to conduct drug repurposing for AD by combining 
protein-protein interactions (PPIs), drug-target interactions (DTIs), and AD-protein interac- 
tions (APIs) from multiple data sources. To investigate the effectiveness of MPI, we compare 
MPI against a commonly utilized network-based drug repurposing method for AD,!*:?? denoted 
as BSL, using our network. Our experimental results demonstrate that among the top-50 ranked 
drugs, MPI prioritizes 20% more drugs with anti-AD evidence compared to BSL. We examine 
published literature and analyze insurance claims meta data to evaluate the evidence of anti-AD 
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activity among MPI’s top prioritized candidates. The results of our evaluation find consensus 
between published experimental results and our own analysis for a few drug candidates. No- 
tably, angiotensin converting enzyme inhibitors (ACE-INHs) represent a class of drugs that 
should be further explored for their anti-AD properties. Moreover, other drugs, such as nico- 
tine, that enhance the brains response to acetylcholine and reduce cholinergic atrophy should 
be examined as well. Conversely, we find that, relative to other evaluated drugs, long-term use 
of trihexyphenidyl increases the risk of AD. This was corroborated by previously published in 
vivo experiments.” Finally, we find etodolac to confer the lowest risk of developing AD among 
all cyclooxygenase inhibitors (COX-INHs) in our network. Altogether, these findings suggest 
that MPI may be a viable option with respect to identifying repurposing candidates to treat 
AD. 


2. Materials and Methods 
2.1. Network construction 


PPIs, DTIs and APIs have shown utility for AD drug repurposing.!? As such, we construct our 
network using these interactions. Below, we describe our process for compiling the PPIs, DTIs 
and APIs used to construct our network from public data sources. In total, our network has 
327,924, 2,854, and 230 edges corresponding to PPIs, DTIs, and APIs. These edges connect 
one AD node, 18,527 protein nodes, and 386 drug nodes. 


2.1.1. Protein-protein interactions (PPIs) 


Following Chen et al.,!° we include a comprehensive list of human PPIs consisting of 327,924 
interactions. This list aggregates a total of 21 bioinformatics and systems biology databases 
with combinations of five types of experimental evidence. We refer the audience of interest to 
Chen et al.!° for a detailed description of the databases. 


2.1.2. Drug-target interactions (DTIs) 


We assemble drug-target interactions and bioactivity data from 4 commonly used databases 
(each downloaded in November 2022): the ChEMBL database” (v31), the binding database,”° 
the therapeutic target database,?’ and the IUPHAR/BPS guide to pharmacology database.?® 
We retain the drug-target interactions that satisfy the following inclusion criteria: 1) binding 
affinities, including K;, Ka, ICso, or ECs59, must be less than or equal to 10 uM; 2) protein 
targets and their respective proteins must have a unique UniProt?’ accession number; 3) 
protein targets must be marked as reviewed in the UniProt database; 4) protein targets must 
be present in homo sapiens. 

Additionally, we retain drugs for which we have sufficient sample size to conduct quanti- 
tative analysis using MarketScan*® insurance claims meta data (see Section [2.4p. Specifically, 
included drugs have at least 100 patients with their first dose at least 2 years prior to an 
AD diagnosis (dx). Additionally, these drugs must have at least 15 patients who eventually 
received an AD dx. Applying these filters yielded 2,854 edges connecting 386 FDA-approved 
drugs to 548 protein targets. 
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2.1.3. AD-protein interactions (APIs) 


The AD-associated proteins included in the network were identified from multiple sources. 
54 6-amyloid-related proteins and 27 tauopathy-related proteins were obtained from Cheng 
et al..'° The authors identified proteins that satisfied at least one of the following criteria: 
1) the proteins are validated in large-scale amyloid or tauopathy genome-wide association 
studies; 2) in vivo experimental models exhibit evidence that knockdown or overexpression of 
the protein leads to AD-like amyloid or tau pathology. We also include 93 unique late-onset AD 
common risk proteins identified by 7 large-scale genetic studies .2!-37 We further incorporate 
a set of 118 AD-associated proteins introduced in at least 2 out of the 6 following databases 
(each was downloaded in November 2022): the online Mendelian inheritance in man database,*® 
the comparative toxicogenomics database,®? the HuGE navigator database, the DisGeNET 
database“! (v7.0), the ClinVar database? and the Open Targets database* (v22.09). In total, 
our network is comprised of 230 unique, AD-associated proteins. Each of the AD-associated 
proteins are connected to a single AD node with each edge between a protein and the AD node 
representing an API in our network. 


2.2. Modeling path importance for AD drug repurposing 


In this work, we denote the constructed network as G. Each node in G is denoted as vi. 
Specifically, drug nodes, protein nodes and the AD node are v¢, vf, and v? , respectively. Note 
that the index, i, does not apply to the AD node as there are not multiple in our network. Each 
edge that connects node v; to node v; is denoted as e;;. Each path is denoted as pm, and the 
set of edges involved in a path is denoted E,,,. Below, we denote matrices, scalars and row 
vectors using uppercase, lowercase, and bold lowercase letters, respectively. 

In MPI, we leveraged DeepWalk,”! a widely used node embedding approach, to learn embed- 
dings for each node in G. First, for each node v; in the network, we conduct 256 random walks 
originating from this node, and terminating once the path length reaches 128. DeepWalk is then 
trained by sliding a window of length 10 over the generated paths. Nodes within the same 
window are forced to have similar embeddings following the objective function defined in the 
original paper.?! Node embeddings for MPI are produced such that they have 128 dimensions. 

After generating node embeddings, we score edge, e;;, using a normalized dot product of 
the embedding of v? (x=d, g or a) and v} (y=d, g or a) as follows: 


exp(vivi") 


(1) 


Wij = 
I Spev) 
where wj is the score of the edge e;j; vf and vý is the learned embedding of node v? and 
vy, respectively; exp(-) is the exponential function; and V is the set of all the nodes in the 
network. Note that, in Equation 1 only one of v? and A could be the AD node. These edge 
scores are calculated with node embeddings which implicitly capture the rich structural in- 
formation within the network. Thus, compared to existing methods, MPI can better leverage 


a network’s structural information for AD drug repurposing. We calculate the score for each 
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path by multiplying the scores of its individual edges as follows: 


Spm = [[ wis, (2) 


eijEEpm 


where s,,, is the score of the path pm; and E, is the set of all the edges in the path p. The 
score for each drug (i.e., s,z) is then defined as the summation of the scores from all 3-hop or 
shorter paths that originate from the AD node and terminate at the drug node. 


2.3. Baseline method 


To evaluate the performance of MPI, we compare MPI against a network-based method recently 
developed by Cheng et al.,'* denoted as BSL. BSL scores drugs based on the shortest distance 
between the drug targets and the AD-associated proteins (Section 2.1.3) in the network. Specif- 
ically, we denote T(z) as the set of protein targets associated with a given drug vt, and denote 
P as the set of AD-associated proteins. The proximity between these two sets is calculated as 
the average shortest distance between elements in T(i) and P as follows: 


1 . ; 
r(T(2), P) = IT) + P] pa ee: d(v; ; vg) + : A vjeT() d(vp, vj) ; (3) 


where r(T(i), P) is the proximity between these two sets; |T(i)| and |P| is the size of T(z) and 
P, respectively; and min,, <p d(v,;, vk) is the shortest distance between v; and any elements in 
P. Subsequently, we conduct a permutation test to assess the statistical significance of the 
calculated proximity. The resulting z-score from this test is used as the score of drug v;.!3 In 
BSL, a lower drug score implies a higher potential for effective AD treatment. 


2.4. Validation using MarketScan database 


We use MarketScan medicare supplemental database from 2012-2021 to evaluate drug impact 
on AD onset via Cox proportional-hazard models.2° The MarketScan database includes data 
for over 8 million unique individuals and is comprised of demographic information, adminis- 
trative information, diagnoses, procedures, and pharmacy records. International Classification 
of Disease (ICD)-9/ICD-10 codes denote diagnoses and National Drug Codes (NDCs) record 
pharmacy claims. We use the ICD-9/ICD-10 codes listed in Supplementary Table S4® to de- 
fine AD and comorbidities, which are included as covariates in Cox proportional-hazard models. 
We conduct our analysis over 1,632,218 unique individuals who were at least 65 years by 2022 
and possessed a minimum of five years insurance enrollment prior of first AD diagnosis. Drugs 
from our constructed network are mapped to NDC codes by partial matching of generic names 
from MartketScan redbook. We only include patients who took or started taking a drug at 
least two years prior to AD diagnosis to mitigate the possibility that patients starting a drug 
already had AD given that AD is difficult to diagnosis. 
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Fig. 1: Figure|lp shows the network construction process in MPT. Figure[ib shows the DeepWalk- 
based node embedding generation in MPI. Figure [IF shows the edge, path and drug scoring in 
MPI. 


3. Results 
3.1. MPI for AD drug repurposing 


In this study, we curate a network consisting PPIs, DTIs and AGIs and propose a novel 
network-based method, MPI, for AD drug repurposing. We propose MPI with the following 
intuitions: 1) proteins that associated with AD are localized in the corresponding disease module 
within the comprehensive human PPI network; 2) the drug target(s) for a disease may also 
be targeted for other diseases (e.g., AD) owing to common functional targets and pathways 
elucidated by PPIs; 3) if a drug node is linked to the AD node through the paths of drug 
targets and AD-associated proteins in the PPI, the drug may have a treatment effect on AD. 
We implement MPI using the following steps: 1) integrate AD-protein interactions, drug- 
target interactions and protein-protein interactions to generate a comprehensive network (Fig- 
ure|lh), 2) employ DeepWalk to learn node embeddings which capture the structural informa- 
tion within the network (Figure|1b), and 3) score edges, paths and drugs based on the learned 


#Supplementary material and code can be found here: https: //github.com/ninglab/MPI 
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embeddings to leverage the structural information for better AD drug repurposing (Figure (ip). 
Then we identify plausible treatment candidates from the top-ranked drugs using a literature 
search of the published evidence. We collected 327,924 PPIs from 21 bioinformatics and sys- 
tems biology databases (Section 2-1.1). We also collected 2,854 DTIs from 4 commonly used 
databases (Section[2.1.2), and 230 comprehensive APIs from multiple resources (Section[2.1.3). 
By aggregating all the interactions, we construct a drug-protein-AD network comprised of 386 
drug nodes, 18,527 protein nodes, 1 AD node, and 331,008 edges. More details about the net- 
work construction are available in Section |2| To the best of our knowledge, MPI is the first 
method which effectively repurposes drug candidates for AD treatment by prioritizing paths 
between drug nodes and the AD node using learned node embeddings. 


3.2. Comparing anti-AD evidence of MPI’s and BSL’s top-50 drugs 


We compare the top-50 drugs prioritized 
by MPI and BSL to evaluate their capacity 
for repurposing drugs to treat AD. Specifi- 
cally, we score and rank all 386 drug nodes 
in our network using MPI and BSL. The 
complete rankings are reported in Supple- 
mentary Table $3. We then perform a lit- 
erature search to evaluate the anti-AD evi- 
dence of the top-50 ranked drugs for both 
MPI and BSL. We define anti-AD evidence 
as any published experimental result(s), 
which demonstrate a drug either protects 
against the development of AD or amelio- 
rates aberrant cellular phenotypes caused 
by AD. We present MPI’s and BSL’s top-10 
drugs and their anti-AD evidence in Table] 
and Table respectively. The complete 
rankings for the top-50 drugs and their 
anti-AD evidence is available in Supplemen- 
tary Tables S1 and $2. Based on the sig- 
nificance of the anti-AD evidence, we cate- 
gorized drugs into the following 6 types in 
decreasing order of significance: 1) drugs 
which are FDA-approved for AD treatment 
(approved); 2) drugs that have demon- 
strated anti-AD effects in completed clinical 
trials or are under investigation in AD clin- 
ical trials (clinical); 3) drugs which have 
demonstrated anti-AD effects in in vivo ex- 
periments (in vivo); 4) drugs which have 
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Fig. 2: Evaluation of drug rank distributions: 
MPI and BSL. 
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demonstrated anti-AD effects in in vitro ex- 

periments (in vitro); 5) drugs which show 

anti-AD effects in observational studies, cohort studies or analyses in insurance data (other); 
6) drugs that either do not have the above 5 types of evidence or have been demonstrated 
ineffective or damaging for AD (NA). We present the distribution of the top-50 drugs from 
MPI and BSL over the different types of evidence in Figure [2a]and the counts of each evidence 
type in Figure In Figure we observe more drugs with evidence ranked highly by MPI 
compared to BSL. This is supported by Figure which confirms that MPI identified more 
evidential anti-AD drugs compared to BSL in the top-50 ranked drugs. Specifically, among the 
top-50 ranked drugs, MPT prioritized 24 evidential anti-AD drugs while BSL only prioritized 20 
evidential anti-AD drugs, demonstrating an improvement of 20%. Figures [2a] and [2b] also show 
MPI outperforms BSL in prioritizing drugs with significant evidence. MPI prioritizes all the 4 
FDA-approved anti-AD drugs (e.g., galantamine, rivastigmine, donepezil and memantine) in 
our network among the top-50. In contrast, BSL prioritizes only a single FDA-approved anti-AD 
drug (donepezil) among the top-50. 

We also observe in Table [I] and Table [2|that MPI is more effective than BSL at prioritizing 
anti-AD drugs among the very top (top-10) of the ranking list. That is, among the top-10 drugs, 
6 drugs from MPI have anti-AD evidence including the FDA-approved AD drug galantamine, 
while only 4 drugs from BSL are evidential. As presented in Section |2| compared to BSL, MPI 
learns node embeddings to capture the rich structural information within the network, and 
leverage the structural information to better identify anti-AD drugs. The superior performance 
of MPI over BSL demonstrates the effectiveness of leveraging the network structural information 
to conduct repurposing to identify candidates for AD treatment. We also notice that both MPI 
and BSL prioritize 17 drugs in concordance within their top-50 drug lists. Among the 17 drugs, 
5 drugs demonstrate anti-AD evidence: donepezil is an FDA-approved anti-AD drug; nicotine 
and rasagiline have clinical anti-AD evidence; and fluvoxamine and fluoxetine have in vivo anti- 
AD evidence. The drugs nicotine, rasagiline, fluvoxamine, and fluoxetine could be promising 
repurposing candidates. We leave the investigation of these drugs to future research. 


3.3. Identifying repurposing candidates with anti-aD activity 


In order to identify plausible candidates for repurposing, we produce Cox proportional-hazard 
models (see Section to ascertain whether there is consensus between the MarketScan 
insurance data and the AD-related evidence we found for top ranked candidates prioritized by 
MPI. Specifically, we use hazard ratios (HR) to identify whether any evidential drug elicited 
reduced the risk of AD diagnosis among patients who took the drug compared against those 
that did not. We present each drug’s HR with their significance levels in Supplementary Table 
S5b; the HR for each drug’s covariates (sex, age, and additional common comorbidities) are 
reported in Supplementary Table S5c-t. A HR below 1 indicates that a drug has a protective 
effect, while a HR above 1 indicates that a drug has a damaging effect. Figure Bal plots Kaplan- 
Meier (KM) survival curves. These plots depict a patient’s likelihood of being diagnosed with 
AD following long-term use of either an individual prescribed drug or a drug with a given 
mechanism of action (MOA). For MOAs, we group highly-prioritized drugs with published 
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Table 1: Top-10 Drugs from MPI 


Drug MOA Indication Anti-AD Evidence 
varenicline AChR-Ag smoking cessation N - 
fosinopril ACE-INH hypertension Y in vivott 
nicotine AChR-Ag smoking cessation Y clinical*® 
nizatidine histamine receptor antagonist duodenal ulcer disease N - 
piroxicam COX-INH osteoarthritis Y other46:47 
meloxicam COX-INH osteoarthritis Y in vivo 50 
galantamine AChE-INH Alzheimer’s disease Y approved 
bromfenac COX-INH inflammation N - 
etodolac COX-INH osteoarthritis Y in vivo?! 
pyridostigmine AChE-INH myasthenia gravis N - 


In this table, the column “Drug” shows the identified top-10 ranked drugs; the column “MOA” shows 
the mechanism of action of each drug; the column “Indication” presents the indication of each drug; 
the column “Anti-AD ” indicates if the drug has evidenced anti-AD effects; and the column “Evidence” 
presents the type of the evidence. In this table, ACE-INH represents the angiotensin converting enzyme 
inhibitor; COX-INH represents the cyclooxygenase inhibitor; AChE-INH represents the acetylcholinesterase 
inhibitor; and AChR-Ag represents the acetylcholine receptor agonist. 


Table 2: Top-10 Drugs from BSL 


Drug MOA Indication Anti-AD Evidence 
; bacterial 30S respiratory tract i sage 
Egle Hane ribosomal subunit inhibitor infections x ore oeieo 
selegiline monoamine oxidase inhibitor Parkinson’s Disease N - 
bacterial cell wall e a HE 
i Ea aay g Y Be 
ceftriaxone kesa inlabeton gonorrhea in vivo 
ibuprofen COX-INH headache N - 
levobunolol adrenergic receptor antagonist glaucoma N - 
ketoprofen COX-INH rheumatoid arthritis N - 
. aromatic L-amino acid : ake 
carbidopa Aucadbouy lade i Parkinson’s Disease N - 
sulindac COX-INH osteoarthritis Y in vivo>* 
biotin vitamin B supplement Y in vivo” 
lansoprazole ATPase inhibitor heartburn N - 


In this table, the column “Drug” shows the identified top-10 ranked drugs; the column “MOA” 
shows the mechanism of action of each drug; the column “Indication” presents the indication of 
each drug; the column “Anti-AD ” indicates if the drug has evidenced anti-AD effects; and the 
column “Evidence” presents the type of the evidence. In this table, ACE-INH represents the an- 
giotensin converting enzyme inhibitor; COX-INH represents the cyclooxygenase inhibitor; AChE-INH 
represents the acetylcholinesterase inhibitor; and AChR-Ag represents the acetylcholine receptor 
agonist. 


evidence of anti-AD activity (see Table[I). Bupropion (HR = 1.04; non-significant) was included 
as a negative control as clinical trials found the drug had no significant effect on cognition in 
AD patients.°° Trihexyphenydil (HR = 1.71; œ < 0.001) was included as a positive control for 
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damaging effects due to the evidence documented in Supplementary Table S1. The COX-INHs 
group includes the following drugs: piroxicam, meloxicam, etodolac, and flurbiprofen. The 
ACE-INHs group includes the following drugs: fosinopril, trandolapril, and lisinopril. Note that 
we only include blood brain barrier (BBB) crossing ACE-INHs in this group as non-BBB-crossing 
ACE-INHs have exhibited very limited effects on AD.°” We also include time-to-event analysis for 
4 of BSL’s top prioritized drugs (See Supplementary Figure S1). Unlike MPI, we observe only 
one of BSL’s drugs (sulindac) with reduced time-to-event compared to bupropion; however, 
this difference is not significant. 


1.00 = 1.00 f= 1.00 
ie 
0.98 0.99 
0.98 
.98 
= 0.96 QB Q 
2 5.96 50.97 
a 0.94 a a 
= = = 0.96 
= ACE — INHs z gv 
© 5 5 
€0.92} ——— bupropion g 0.94 = 
Z Z z 0.95 
x — COX- INHs x x 
— hydralazine —  etodolac 
0.90 T 0.94 l 
—— nicotine 9.92] —— flurbiprofen —— fosinopril 
—— pentoxifylline —— meloxicam ei lisinopril 
0:88) | trihexyphenidyl —— piroxicam | | —— trandolapril 
0.90 
1000 2000 3000 1000 2000 3000 1000 2000 3000 
Time (days) Time (days) Time (days) 
(a) Drugs and MOAs with (b) COX-INHs. Shaded regions (c) ACE-INHs. Shaded regions 
published anti-AD evidence represent 95% confidence represent 95% confidence 
intervals. intervals. 


Fig. 3: Unadjusted Kaplan—Meier plots for cox proportional-hazard models 


3.4. Analyzing the MOAs of MPI’s top-50 drugs 


To identify groups of drugs whose anti-AD properties should be further examined and explored, 
we examine the top-50 drugs prioritized by MPI for any common MOAs. We find that COX-INHs 
and ACE-INHs are the most common MOAs prioritized by MPI. Both COX-INHs and ACE-INHs 
have published evidence of anti-AD activity. That said, experimental results suggest that long- 
term administration of COX-INHs may only have protective properties, reducing the risk of AD 
onset. Moreover, meloxicam (HR = 0.86; a < 0.05), has even shown therapeutic potential, 
reversing cognitive decline via inhibition of neuronal apoptosis.“*9 However, in Figure|3al we 
observe that COX-INHs as a class do not yield reduced risk of AD compared to the negative 
control. That said, we find etodolac significantly reduces the risk of AD (HR = 0.78; a < 0.001) 
compared to other COX-INHs, including flurbiprofen (HR = 0.95; non-significant) (Figure Bb). 
This suggests that only certain COX-INHs, such as etodolac, may elicit protective effects against 
AD onset. Importantly, this may be a result of differences in target as etodolac targets COX2, 
while flurbiprofen targets COX1. On the other hand, ACE-INHs were found to also protect 
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against AD onset in Figure Specifically, we evaluate only ACE-INHs that cross the blood 
brain barrier (BBB) as previous insurance claims metadata analyses have indicated those that 
do not cross the BBB have no effect on AD.57 To see if any of the BBB crossing ACE-INHs have 
a greater protective effect that others, we produce a KM plot for fosinopril, lisinopril, and 
trandolapril (Figure [3ch. Unlike for COX-INHs, ACE-INHs do not elicit any significant by-drug 
difference in AD onset as illustrated in Figure While MPI prioritized four BBB crossing 
(BBBx) and four non-BBBx ACE-INHs in the top-50, the BBBx ACE-INHs had a lower average rank 
compared to the non-BBBx ACE-INHs (15 and 19, respectively). 

Another important distinction between COX-INHs and ACE-INHs is that ACE-INHs have been 
shown to have some ameliorative potential; whereas, COX-INHs have only shown protective 
effects. In fact, fosninpril and lisinopril (ranked 2”¢ and 24" by MPI, respectively) was found to 
reduce cognitive decline in animal models of AD.4458 In Figure |3a| we find that BBBx ACE-INHs 
consistently exhibit decreased risk of AD relative to our negative control drug, bupropion. 
Additionally, there does not appear to be a significant difference between any of the BBBx 
ACE-INHs with respect to their protection against AD, indicating that they are possibly all 
viable candidates for repurposing. This is in agreement with other published evidence that 
has identified BBBx ACE-INHs as having protective effect on AD development. Interestingly, MPI 
prioritized 133.6% more COX-INHs and 700.9% more ACE-INHs than BSL in the top-50 from 
all such drugs in our network. MPI’s ability to prioritize more drugs from MOAs with known 
anti-AD activity suggests that it may be a more viable option when identifying candidates for 
drug repurposing. 

MPI also highly prioritizes drugs that increase the brain’s response to acetylcholine, either 
by reducing its degradation (acetylcholinesterase inhibitors, AChE-INHs) or by stimulating its 
receptors (acetylcholine receptor agonists, AChR-Ags). This is important as acetylcholine’s (ACh) 
synaptic bioavailability is an important contributor to AD progression. That is, there is evidence 
that cholenergic atrophy and ACh deficiency is linked with cognitive decline in AD patients.°? 
Moveover, many of the current FDA-approved drugs indicated to slow AD progression target 
this mechanism of disease progression via AChE-INHs (e.g., donepezil, rivastigmine, and galan- 
tamine). AChR-Ags, also enhances ACh signaling. Such drugs, such as nicotine, accomplish this 
by increasing the response of ACh receptors located on the post-synaptic neuron. Interestingly, 
nicotine, was found to significantly improve cognition in patients with mild cognitive impair- 
ment, which is a precursor to AD.*° We also find long-term nicotine use to have a protective 
effect (HR = 0.532; a < 0.001), with respect to AD onset. In Figure we observe similar risk 
of developing AD to ACE-INHs after six to seven years. Conversely, we find evidence that long- 
term use of trihexyphenidyl, which reduces the activity of ACh receptors, is associated with 
AD-like neurodegeneration in rats.?4 This is corroborated by Figure where we observe the 
highest risk of AD elicited by trihexyphenidyl. More than eight years on trihexyphenidyl was 
associated with a substantial increase in the risk of AD relative to the other drugs evaluated in 
Figure These findings confirm that ACh signaling is closely linked with AD progression. As 
such, exploring other drugs and drug classes which either increase ACh synaptic bioavailability 
or enhance neuronal response to ACh should be further examined for anti-AD activity. 
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4. Discussion 


In this work, we propose a novel network-based, AD-specific drug repurposing approach called 
MPI. MPI improves upon prior network-based methods by leveraging node embeddings learned 
via DeepWalk to prioritize AD-associated paths. Moreover, the use of learned embeddings allows 
MPI to more effectively capture a network’s rich topology than previous approaches, such as 
BSL. In a direct comparison, we find that 20% more of MPI’s highly prioritized drug candidates 
(top-50) have published anti-AD evidence compared to BSL’s highly prioritized drug candi- 
dates. In addition to evidence in literature, we leverage insurance claims data to produce Cox 
proportional-hazard models. Among all the drugs we evaluate, these models identified BBBx 
ACE-INHs as having the lowest risk of AD. Similarly, etodolac was found to have the lowest risk 
of AD among the four COX-INHs we evaluated (Figure|3b), indicating that this drug in particular 
may have protective effect despite the class as a whole not exhibiting a significantly reduced 
risk of AD compared to our negative control (Figure Ba). Additionally, MPI highly prioritizes 
drugs that target the cholinergic system. Each of the approved AD drugs in our dataset that 
are also AChE-INHs are prioritized in the top-50 by MPI. MPI also highly prioritizes nicotine, 
an AChR-Ags. This prioritization is supported by both literature and our Cox models, which 
suggest nicotine is associated with reduced risk of AD. Altogether, the results presented in this 
work highlight etodolac, nicotine, and ACE-INHs as viable candidates for repurposing to treat 
AD and, as such, deserve further examination in future studies. 

Despite its promising results, MPI exhibits a few limitations. The PPI network we construct 
is a simplification of molecular pathways. Like many other network-based approaches, MPI does 
not consider loops nor the directionality of PPI as these can be difficult for models to learn. In 
our context, this means that highly ranked candidates are only likely to be in close proximity 
to AD-related genes. To improve drug prioritization, models must be capable of identifying 
drugs that are both upstream of and in close proximity to these AD-related genes. In future 
studies, we will leverage directed interactions either by hard coding them or learning them. 
One way directionality might be learned is through the use of multi-omics data. Examining 
how changes to genomic and epigenomic profiles affect gene expression could facilitate learning 
where genes are in pathways. Furthermore, by leveraging multi-omics data, we may be able 
to provide more personalized drug recommendations. 
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1. Overview 


Precision medicine and precision public health rely on the premise that determinants of disease 
incidence and differences in response to interventions can be identified, and their biology can be 
understood well enough for the development of individualized interventions that reduce the risk of 
disease and improve treatment. At the same time, well-documented racial and ethnic disparities exist 
throughout healthcare at the patient, provider, and healthcare system levels. These disparities are 
driven by a complex interplay among social, psychosocial, lifestyle, environmental, health system, 
and biological determinants of health (Freedman, et al. 2021). The aim of the PSB 2024 session 
“Overcoming health disparities in precision medicine” is to elicit the development of new methods 
and concepts than can be used in uncovering undetected biases, develop effective therapies and fair 
AI to improve precision healthcare and help reduce these disparities, and ultimately improve health 
equity. 


2. Dealing with the lack of diversity in current research datasets 


An overwhelming focus on individuals of European descent in past genomic studies, which account 
for 86% of all such research, has created inequities in precision medical insights and has limited 
scientific discovery (Fatumo et al. 2022). It is imperative to diversify genomic research data and to 
make investments aimed at understanding and eliminating these health inequalities. 

In the meantime, methods that can use the currently available genomic and clinical data, which 
admittedly are lacking in diversity, to provide equitable prediction of phenotypes are needed. The 
paper by Comajoan Cara et al. (2024) in this proceedings introduces PopGenAdapt, a model that 
tackles the lack of diversity in genomic datasets by using semi-supervised domain adaptation 
techniques. The model effectively leverages labeled data from individuals of European ancestry and 
both labeled and unlabeled data from underrepresented populations. When tested in populations 
from Nigeria, Sri Lanka, and Hawaii, PopGenAdapt showed significant improvement in predicting 
disease outcomes compared to existing methods, highlighting its potential for more inclusive 
biomedical research. 

On the other hand, the paper by Bonet et al. (2024) introduces a machine learning toolkit 
designed to directly improve the accuracy of genomic-based medical predictions for 


© 2023 The Authors. Open Access chapter published by World Scientific Publishing Company and 
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underrepresented populations. By employing techniques such as gradient boosting, ensembling, and 
population-conditional re-sampling techniques to address the lack of diversity, the method enhances 
phenotype prediction accuracy, achieving results comparable to those for well-represented 
European populations. 

Ancestry can impact gene expression prediction methods as potentially undiscovered population 
variants and eQTNs affecting gene expression may exists in understudied populations. The paper 
by Mishra et al. introduces LA-GEM, a gene imputation model that incorporates local ancestry (LA) 
to improve gene expression predictions in African American populations. Tested on a cohort of 60 
African American hepatocyte primary cultures, LA-GEM outperformed existing models like 
PrediXcan by reliably predicting the expression of unique genes critical to drug metabolism in this 
sample. The study highlights the value of leveraging local ancestry in gene imputation models for 
admixed populations to better understand disease susceptibility and drug response in all populations. 


3. Development of fair machine learning algorithms 


The development of fair algorithms and machine learning in healthcare is crucial for reducing health 
disparities, improving diagnostic accuracy, and building public trust. By minimizing biases, 
equitable healthcare and regulatory compliance is promoted, leading to more economically efficient 
systems. 

One of the first steps to algorithmic equity is the thorough exploration of the input data to be 
used in their training for inherent and occult biases. The paper by Orlenko et al. (2024) provides 
examples of such necessary data exploration using cluster analysis to identify two distinct subgroups 
of elective spinal fusion patients based on insurance type. These findings reveal significant 
differences in characteristics and post-surgery outcomes related to socioeconomic and racial 
disparities. The aim is to inform the design of machine learning models to ensure fairness and 
minimize bias in healthcare predictions. 

Methods designed to provide fair algorithmic predictions from the ground up are needed as well. 
Jun et al. (2024) use a fairness algorithm, Fairness-Aware Causal paThs (FACTS), to analyze nine 
years of electronic health records and social determinants of health to quantify disparities in MRSA 
infection outcomes. The study identified moderate disparities in age, gender, race, and income, 
revealing that comorbidities played a role in these disparities. Factors like kidney impairment and 
drug use affected racial disparity, while income and healthcare access affected gender disparity. The 
findings highlight the need for policies that address both clinical factors and social determinants to 
mitigate health disparities. 


4. Race, genetic ancestry, and population structure 


The persistent use of "race" and "ethnicity" in precision medicine, classifications rooted in perceived 
physical characteristics and cultural backgrounds, has generated substantial discussion (Nat. Acad. 
Sci. Eng. Med., 2023). However, for the purpose of examining health equity, these categories remain 
essential for identifying and addressing systemic disparities (Kahu et. Al. (2021). Established by the 
U.S. Office of Management and Budget in 1995, these classifications are integral for organizing 
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data on social determinants of health and population demographics. They guide resource allocation, 
policymaking, and the development of culturally sensitive healthcare interventions. 

The paper by Rhead et al. (2024) tackles the problem of missing disaggregated race and ethnicity 
data in real-world databases by introducing methods for imputing these categories using genetic 
ancestry from available genetic data. Analyzing data from over 100,000 cancer patients, ancestry- 
based machine learning methods were shown to outperform existing race imputation algorithms 
based on geolocation and surnames commonly used in administrative health data. The research 
offers a new way to improve real-world healthcare data for studying and ensuring healthcare equity 
and to enable its use in the development of diversity plans for clinical trials soon to be required per 
FDA guidance. 

On the other hand, the study by Seagle et al. (2024) analyzes the genetic ancestry of 35,842 
individuals over 100 birth years in the Southeastern United States, finding increasing levels of 
genetic admixture and heterozygosity in younger populations since 1990. This rise in diversity poses 
challenges to traditional genotype-phenotype relationship studies. The researchers explore the 
impact of increased admixture on health outcomes, discovering that greater genetic diversity was 
associated with protective effects against female reproductive disorders but elevated risks for 
diseases linked to autoimmune dysfunction. This highlights the influence of ancestral complexity 
on health disparities. 

The social construct of race and ethnicity is far from precise, serving as a poor proxy for 
ancestry. In this vein, the study by Piekos et al. (2024) employs genetic ancestry rather than race to 
assess disease risk factors, leveraging data from the BioVU biobank. Researchers estimated six 
ancestry proportions and performed phenome-wide association studies, finding varying risks for 
conditions like 'Neoplasms' and 'Pregnancy Complications' based on different ancestries. The study 
also found that linear modeling was sufficient for assessing hypertension and atrial fibrillation risk 
in relation to ancestry, but not for renal failure, indicating the need for more complex models in 
certain cases. 


5. Conclusion 


The increased attention to social justice has emphasized the urgent need to tackle health disparities 
more effectively. Advanced computational and statistical approaches are essential for assessing and 
mitigating these disparities in healthcare. Their adoption is not just a technological advancement but 
also an ethical necessity for creating a healthcare environment that serves all communities 
effectively. We believe that the new methods in the collection of research papers accepted to this 
PSB 2024 proceedings can contribute to overcome disparities in precision medicine. 
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The lack of diversity in genomic datasets, currently skewed towards individuals of Euro- 
pean ancestry, presents a challenge in developing inclusive biomedical models. The scarcity 
of such data is particularly evident in labeled datasets that include genomic data linked 
to electronic health records. To address this gap, this paper presents PopGenAdapt, a 
genotype-to-phenotype prediction model which adopts semi-supervised domain adaptation 
(SSDA) techniques originally proposed for computer vision. PopGenAdapt is designed to 
leverage the substantial labeled data available from individuals of European ancestry, as well 
as the limited labeled and the larger amount of unlabeled data from currently underrepre- 
sented populations. The method is evaluated in underrepresented populations from Nigeria, 
Sri Lanka, and Hawaii for the prediction of several disease outcomes. The results suggest 
a significant improvement in the performance of genotype-to-phenotype models for these 
populations over state-of-the-art supervised learning methods, setting SSDA as a promising 
strategy for creating more inclusive machine learning models in biomedical research. 
Our code is available at https://github.com/AI-sandbox/PopGenAdapt. 


Keywords: phenotype prediction, semi-supervised, domain adaptation, underrepresented 
population 


1. Introduction 


Genomic data has become increasingly important for biomedical research, as it can reveal 
insights into the causes, diagnosis, prevention, and treatment of various diseases. However, the 
available data is predominantly from individuals of European ancestry, despite their making 
up only 16% of the global population. This disproportionate representation presents one of the 
major challenges in developing biomedical models and studies that can effectively generalize 
across diverse populations, posing the risk of exacerbating existing health disparities.1 While 
widely adopted datasets such as the UK Biobank? provide rich phenotypic information from 
electronic health records, they lack diversity (see Fig. 1). On the other hand, highly diverse 
datasets, such as gnomAD,? lack phenotypic data, which makes them not directly usable to 
train supervised genotype-to-phenotype machine learning models, as phenotype labels for all 
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the samples are required. New algorithmic solutions are needed in order to profit from all 
available data. 


= White British 
-2 Non-British White 
£ South Asian 
= East Asian 
= African 
0 50000 100000 150000 200000 250000 300000 350000 
Count 


Fig. 1. Broad population counts in the UK Biobank.? Genetically inferred populations groups from 
the Global Biobank Engine.* 


In this work, we propose PopGenAdapt, a semi-supervised domain adaptation (SSDA) 
method that can also exploit the available unlabeled data from underrepresented populations 
to improve the performance of phenotype prediction models. On the one hand, the semi- 
supervised nature of the proposed method makes possible the use of unlabeled data from 
underrepresented populations, as well as labeled data from large biobanks. On the other 
hand, the use of domain adaptation techniques makes it possible to still take advantage of 
the vast amount of data from individuals of European ancestry (the source domain), but to 
adapt the model predictions for a particular underrepresented population (the target domain). 
While SSDA has been previously applied to other types of data such as image and text, its 
application in genetics remains largely unexplored. 

We adapt methods proposed for SSDA in computer vision for genotype-to-phenotype pre- 
diction and evaluate them in underrepresented population groups from Nigeria, Sri Lanka, 
and Hawaii. Our results predicting phenotypes including hypertension, diabetes, myxoedema, 
and asthma, demonstrate that SSDA can significantly enhance the performance of genotype- 
to-phenotype models in underrepresented populations, suggesting a promising direction for 
developing better machine learning models for diverse populations. 


2. Background 
2.1. Genotype-to-Phenotype Prediction 


DNA is the hereditary material in humans and all living organisms, contributing to essential 
functions and appearance. While most positions in the DNA sequence are identical between 
individuals of the same species, some vary. Out of more than 3 billion positions, a typical 
human genome differs from the reference genetic sequence at 4 to 5 million sites (~1.5%).° 
In total, more than 600 million variable positions have been identified across different hu- 
mans. These variable positions are called single nucleotide polymorphisms (SNPs) and can 
be encoded as a ternary sequence, representing the counts of non-reference variants at each 
position, with 0 indicating that both maternal and paternal positions match the reference 
genome, 1 indicating that only maternal or paternal positions match, and 2 indicating that 
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both are alternative variants. 

Phenotypes are the observable characteristics of an organism that result from the inter- 
action between its genotype (the genetic makeup determined by its DNA sequence) and the 
environment. These characteristics comprise physical and behavioral traits, as well as risk of 
developing certain diseases. Both the frequency distribution of genomic variants, and as a 
result, the distribution of phenotypes, vary across different populations. As a consequence, 
most studies developed for a particular population do not generalize well to other population 
groups.! 

The goal of genotype-to-phenotype prediction is to use the genetic variation (SNP se- 
quences) to estimate the phenotypes of an individual. Multiple machine learning models have 
been applied to solve this task, either using general-purpose methods like logistic regression, 
gradient boosting machines, or neural networks,”* or through linear models specifically tai- 
lored to genetic data, such as PRS-CS,° SBayesR,!° or snpnet.!! 


2.2. Semi-Supervised Domain Adaptation 


Supervised learning is the framework most often adopted to train predictive models by using 
input samples and label pairs. However, in many real-world scenarios, such as in biomedical 
applications, obtaining labeled data can be challenging, involving time-consuming and expen- 
sive collection procedures. This limitation suggests the application of semi-supervised learning 
techniques, which can leverage both labeled and unlabeled data for training, providing better 
generalization than traditional supervised learning approaches.'? 

Both supervised and semi-supervised methods assume that the distribution of the train- 
ing data (source domain) is the same as the one found during real-world deployment (target 
domain). However, this is not always the case, leading to distribution shifts that can drasti- 
cally decrease the predictive performance. In order to address this shift, domain adaptation 
techniques have been proposed to properly adjust the models to bridge the gap between dis- 
tributions and achieve accurate predictions in both the source and target domains. 


SSL SDA SSDA 


050, 05°, O $ Labeled Source 
B02 3° ra g? O & Labeled Target 
Se" O 


8 %8 eS % %, 3 “ ~ Unlabeled Source 
Fig. 2. Illustration of supervised learning (SL), semi-supervised learning (SSL), supervised domain 
adaptation (SDA), and semi-supervised domain adaptation (SSDA) in the case of binary classifi- 
cation. Circle and cross markers represent negative and positive classes, opaque and transparent 


markers represent labeled and unlabeled points, and blue and orange markers represent source and 
target domains, respectively. 


Semi-supervised domain adaptation (SSDA) combines both semi-supervised learning and 
domain adaptation paradigms. The goal of SSDA is to leverage labeled data from a source 
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domain, unlabeled data from the target domain, and a limited set of labeled data from the 
target domain, in order to obtain a machine learning model that achieves good performance 
within both domains. 

In this paper, we adapt for genotype-to-phenotype prediction the state-of-the-art method 
of SSDA via Minimax Entropy (MME)! with Source Label Adaptation (SLA),'* which was 
originally proposed in computer vision, considering different image domains, like photos, draw- 
ings, or paintings. Here, instead, we will consider different domains to be different populations. 


2.2.1. Minimax Entropy 


Ti 


Unlabeled 


— A Se eS Gradient Flipping |- - - - - 


Backward path for unlabeled target examples 


Fig. 3. Overview of the model architecture and minimax entropy proposed in Ref. 13. 


Minimax Entropy (MME, Ref. 13, Fig. 3) proposes to use a neural network model con- 
sisting of a feature extractor F and a classifier C. At the output of F, 4 normalization and 
temperature scaling are applied, inspired by Ref. 15. In the original work, F is a pre-trained 
ResNet34,!° an image classification network, and C is a single layer which takes rire as 


input and outputs g(2;) =o (pest): The weight vectors W = [w1,...,wx] can be regarded 


as a representative point of each class k, or “prototype”. 

Both C and F are trained to classify labeled examples correctly by minimizing the cross- 
entropy loss Leg on the labeled data, from both the source and target domains. However, to 
avoid overfitting on the source domain, which contains a larger amount of samples, as well 
as to take advantage of the unlabeled target data, it has been proposed to use an adversarial 
regularization term, the Minimax Entropy. MME is formulated as adversarial training between 
F and C, in which F is trained to minimize the conditional entropy H of the neural network 
predictions from unlabeled target data p(x), whereas C is trained to maximize the entropy of 
the predictions p(z;). This adversarial learning forces F to learn discriminative features, while 
C estimates domain-invariant prototypes reducing the overfitting to the source domain. The 
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overall adversarial learning objective functions are: 


Êr =argmin Log + AH (1) 
Or 

6c = argmin Log — AH (2) 
0c 


where à is a hyperparameter to control the tradeoff between classification on labeled examples 
and the minimax entropy training. To simplify the training process, MME makes use of a 
gradient reversal layer!” to flip the gradient between C and F with respect to H, allowing to 
perform the minimax training with a single forward and backward pass. 


2.2.2. Source Label Adaptation 


Source Label Adaptation (SLA, Ref. 14) is a framework that considers source data as a noisily- 
labeled version of the target data and gradually adapts the source labels to the target space. 
Specifically, inspired by Refs. 18,19, for each source point zf, one constructs a modified source 
label gf by combining, with a tradeoff ratio a, the original source label yf and the prediction 
of a source label adaptation model p: 


Ji = (1 — a)y? + apla) (3) 

Note that p cannot be the current unadapted model g as it would overfit to the source 
data due to the larger number of samples, resulting in almost no effect. Thus, it has been 
proposed to train on the target domain data. However, to avoid simple memorization of the 
target data due to the low number of labeled samples available, it has been proposed to use 


a prototypical network (protonet), a model for few-shot learning. Given a feature extractor 
F, the prototype of class k is defined as the center of features with the same class: 


N 
1 
ee =p F (zi 4 
Ck N; >, lity, =k} (xi) (4) 


Then, a protonet produces a distribution over classes for a query point x; based on a 
softmax with temperature 7 over the Euclidean distances to the prototypes in the embedding 
space: 

exp(—d(F (ai), ¢4)7) 6) 
Xop eXxP(—d(F (xi), ce )T) 

Moreover, in Ref. 14 it is proposed to derive the prototypes using the unlabeled data 
available by using, for each unlabeled target instance x¥, pseudo labels gy’ computed by the 
current model g: 


p(zi)k = 


Jr = ae glx )k (6) 


Using these pseudo labels, we can get pseudo centers by Eq. 4, and further define with them a 
Protonet with Pseudo Centers (PPC) by Eq. 5. Next, the PPC is applied to Eq. 3 to compute 
the modified source labels gf for each source instance x7. Finally, the real source labels y? 
are replaced by the cleaned source labels 9? in the computation of the cross-entropy for the 


labeled source part of the whole dataset. The loss for labeled target data can still be a standard 
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cross-entropy loss. Other loss terms can still be included, like the minimax entropy proposed 
in Ref. 13. 

In practice, the SLA framework is only applied after W warmup steps in which the model 
is trained normally with the original source labels to obtain an initial robust model, and then 
the pseudo labels are only recomputed every I steps for efficiency. Since the SLA paradigm 
of considering the source labels as noisy from the target domain viewpoint and cleaning them 
is orthogonal to the ideas in MME, ?!? both approaches can be combined to get superior results. 
We refer to this combination as MME-SLA. 


2.3. Semi-Supervised Learning and Domain Adaptation for 
Genotype-to-Phenotype Prediction 


Semi-supervised learning techniques have been previously applied in genotype-to-phenotype 
prediction. For example, Ref. 21 proposed a method to predict the residual feed intake in 
dairy cattle using both labeled and unlabeled samples. However, the samples are assumed to 
be from the same domain, so the method would still have the problem of not generalizing to 
other populations. 

Likewise, domain adaptation techniques have also been applied in genotype-to-phenotype 
prediction. For instance, Refs. 22-24, proposed several transfer learning techniques to also 
improve prediction performance for underrepresented populations. However, the proposed ap- 
proaches cannot utilize unlabeled samples, thereby still grappling with the scarcity of labeled 
data from underrepresented populations. Consequently, the achieved performance improve- 
ment remains limited. To our knowledge, this is the first work to combine both approaches 
by applying semi-supervised domain adaptation for genotype-to-phenotype prediction. 


3. Method 
3.1. Data 


We apply these methods to predict multiple disease outcomes, including hypertension, dia- 
betes, myxoedema, and asthma, for individuals from populations underrepresented in com- 
monly used datasets, including Nigeria, Sri Lanka, and Hawaii, available in the UK Biobank? 
and the PAGE study.”° In order to have meaningful results, we limit the phenotypes to these 
four, as they have a high enough case count within the three target populations. For each 
phenotype, we use balanced data from white British individuals as the source domain, ob- 
tained by removing samples from the majority class. Then, for each phenotype and target 
population, we use the labeled source domain data as well as labeled and unlabeled data from 
the target domain. To test the method’s efficacy, we use a subset of labeled data exclusive to 
the target underrepresented population. 

To establish which samples constitute the target population domain, we propose two ap- 
proaches, which show two different ways in which we can take advantage of the availability of 
datasets, even when the labeled data in the target domain is very scarce. The first approach 
is adopted for the Nigerian and Sri Lankan populations, and the second one for the Hawaiian 
population. 
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Fig. 4. Two-dimensional PCA projection of the samples in the UK Biobank and the Hawaiian 
dataset. The PCA was fitted with only the samples from the UK Biobank. Note that all samples 
marked as Sri Lankan fall within the South Asian genetic ancestry cluster, and all the Nigerian ones 
fall within the African cluster. 


The first approach only uses data from the UK Biobank.? To establish which samples 
constitute the target population, we combine the genetically inferred ancestry available from 
the Global Biobank Engine* (white British, non-British white, South Asian, East Asian, or 
African) and the country of birth reported in the UK Biobank.? We use both fields because the 
inferred genetic ancestry provides a continental-level description, encompassing many regions 
within each label. On the other hand, the country of birth alone is not representative of the 
ancestry composition within the UK Biobank due to high selection bias, as the samples were 
collected in assessment centers in the United Kingdom, so many individuals in the data born 
outside the United Kingdom are still of English genetic ancestry. By filtering both by inferred 
population group and country of birth, we ensure that the definition of the target domain is 
precise. 

In particular, for the case of Nigeria, we only keep the samples that are of African genetic 
ancestry and born in Nigeria, and for the case of Sri Lanka, the samples that are of South 
Asian genetic ancestry and born in Sri Lanka. This results in a total of 852 samples for the 
Nigeria group and 535 samples for the Sri Lanka group. Once we have the samples from the 
target domain, since the UK Biobank has phenotype labels for all the samples, we artificially 
unlabel half of them for the purpose of evaluating the proposed method. For training, we use 
all the unlabeled samples plus only 10 labeled samples from the target domain, 5 negative and 
5 positive, alongside all the labeled samples from the source domain. The rest of the labeled 
individuals of the target domain are split into two equal parts using stratified sampling to 
create the validation and test sets. Note that we can only use labeled data for validation and 
testing. 
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The second approach to define the target domain shows how additional unlabeled datasets 
can be employed. To achieve this, in addition to the UK Biobank,? we use a dataset of SNP 
sequences (without phenotype labels) of 5,862 Native Hawaiian individuals from the PAGE 
study.” In this setting, we only have unlabeled data from the target population. Note that we 
cannot use the country of birth field, as the people born in Hawaii are labeled as born in the 
USA. To have labeled data in the target domain, we propose to use the nearest neighbor of 
each sample from the Hawaiian dataset within the UK Biobank, excluding the white British 
individuals to avoid having repeated samples in both the source and target domains. For 
efficiency, we compute the distances between samples on the first 50 principal components, 
instead of using the raw SNP sequences. After removing duplicated individuals that are the 
nearest neighbor to more than one sample from the Hawaiian dataset, we obtained 1,689 
labeled samples. While it is unlikely that the UK Biobank contains this many individuals of 
Hawaiian ancestry, the closer distribution of these samples to the Native Hawaiian population 
makes the domain more apt to model them than using samples of predominantly European 
ancestry. 

The second approach to defining the target domain is less accurate than the first one, as it 
includes samples from other similar populations. However, it has the advantage that it results 
in a larger number of samples, which can be helpful for unbalanced phenotypes with a low 
positive case count, and to counteract the effect of having a noisier target domain definition. 
In this scenario, we use 50% of the labeled target samples for the training set, 25% for the 
validation set, and 25% for the test set. Note that the unlabeled samples used for training are 
the ones from the Hawaiian dataset from the PAGE study.?° 


Table 1. Size of sets used for training and evaluation for each population. Note that 
a combination of white British as the source domain plus another population as the 
target domain is always used. 


Population Training labeled Training unlabeled Val. + Test labeled Total 


White British * 0 0 * 
Nigeria 10 213 106 + 107 852 
Sri Lanka 10 134 67+ 67 535 
Hawaii 822 5,862 412 + 413 7,507 


*White British set size depends on class balancing, but in all cases is >40,000. 


We use the variants that are both in the UK Biobank data and the Hawaiian dataset, 
resulting in 83,362 overlapping SNPs. Note that we decided to not impute the SNPs outside 
the intersection to avoid introducing a bias. Most algorithms to perform statistical imputation 
are based on the available samples, and since in this scenario most of them are not from the 
underrepresented population, the imputation could result in incorrect values. 
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Table 2. Case counts for each disease and population (only considering labeled samples). 


Population Hypertension Myxoedema Diabetes Asthma 
White British* 114,687 (50.00%) 21,471 (50.00%) 23,099 (50.00%) 45,192 (50.00%) 
Nigeria 105 (47.08%) 11 (4.93%) 38 (17.04%) 21 (9.41%) 
Sri Lanka 60 (41.66%) 18 (12.50%) 41 (28.47%) 29 (20.13%) 
Hawaii 535 (31.68%) 69 (4.09%) 153 (9,05%) 197 (11.66%) 


*White British phenotypes are balanced by undersampling the negative class. 


3.2. Model 


We adopt the MME-SLA!*4 method originally proposed for classification tasks in computer 
vision for genotype-to-phenotype prediction by replacing the ResNet34'© backbone model 
used in the original works with a multi-layer perceptron (MLP). Specifically, we use a 4-layer 
MLP with GELU activations,” layer normalization,” and a residual connection!® between the 
output of the first layer and the input of the last one. The choice of activation and the use of 
layer normalization and a residual connection is commonly adopted in modern architectures 
such as Transformers?’ and has been proven to help improve the performance of the models, 
as well as their stability during training. The initial layer of the network takes an input size 
corresponding to the number of SNPs and reduces it to a hidden size of 256. Subsequently, 
the two middle layers maintain the same input and output dimensions of 256. Next, before 
the last layer, 42 normalization and temperature scaling with T = 0.05 is applied, as proposed 
in Refs. 13,15. Lastly, the last layer, which acts as the classifier, produces an output size 
equivalent to the number of classes. We call the complete model PopGenAdapt. 


Nhidden 


NsnPs Nelasses 


Fig. 5. Diagram of the backbone MLP model for PopGenAdapt. 


The backbone MLP model without the MME-SLA components is also used as the base- 
line model to compare how applying SSDA improves against a typical supervised learning 
approach. 

We train the baseline and PopGenAdapt models for each combination of target population 
and phenotype using a batch size of 64, the AdamW optimizer?? with weight decay of 0.01, and 
the same learning rate scheduler used in Ref. 14. We use randomized hyperparameter search 
to tune several hyperparameters. For the baseline method, we only tune the learning rate. For 
PopGenAdapt, we tune both the learning rate and the MME-SLA!*4 hyperparameters (A, a, 
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tT, W, and I). Table 3 shows the hyperparameter space from which the values were sampled. 
Our experiments showed that the resulting performance is highly sensitive to the choice of 
hyperparameters, as also pointed out in the paper which introduced SLA.!4 We select the 
model with the best validation AUROC on the target domain and perform the final testing 
on a separate hold-out test set also using AUROC on the target domain. 


Table 3. Definition of the distribution of the hyperparameter space. 


Hyperparameter Probability distribution 

Learning rate LogUniform(10~°, 1072) 

MME tradeoff A Uniform(0, 1) 

SLA mix ratio a Uniform (0, 1) 

SLA temperature T Uniform(0, 1) 

SLA warmup W UniformChoice({100, 500, 1000, 2000, 5000} 


SLA update interval 7 UniformChoice({5, 10, 100, 500, 1000, 2000, 5000}) 


Note that while PopGenAdapt employs both the labeled and unlabeled samples, the base- 
lines are trained on the subset that is labeled, as it has no way of using the unlabeled samples. 

The training and inference was performed with an NVIDIA GeForce GTX 1080 Ti GPU 
(11 GB), and took between 10 and 50 minutes, depending on the number of samples and the 
hyperparameters, for each configuration. 


4. Results 


We compare PopGenAdapt with the baseline model consisting only of the backbone MLP 
(MLP Base), as well as with the state-of-the-art genotype-to-phenotype snpnet! model, and 
PRS-CSx,°° which is an extension of PRS-CS° to improve polygenic prediction in ancestrally 
diverse populations. Note that since snpnet and PRS-CSx are supervised models, like in the 
case of the baseline model, they can not exploit the unlabeled samples. 

We show the results obtained for each of the four phenotypes on the three tested target 
underrepresented populations in Tables 4-6. 

PopGenAdapt outperforms snpnet, PRS-CSx, and the baseline model on average and in the 
majority of evaluated scenarios. Moreover, we observe that snpnet, PRS-CSx, and the baseline 
model obtain in multiple cases an AUROC below 0.5, indicating a predictive performance 
worse than the one obtained by random guessing. We note that this does not happen in any 
of the experimented cases for PopGenAdapt. Considering that snpnet and the MLP baseline 
methods do not perform any type of domain adaptation, it makes sense for this to happen, 
as the models are tested on a domain that differs from the one in which most of the training 
samples are. 

We hypothesize that a possible reason for the poor performance of snpnet on non-European 
populations is due to the use of the lasso in the method, which performs SNP selection, thus 
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Table 4. AUROC for the Nigerian population. 


Method Hypertension Myxoedema Diabetes Asthma Average 
snpnet!! 0.4647 0.7949 0.4699 0.4646 0.5485 
PRS-CSx°0 0.3884 0.1827 0.6046 0.4306 0.4016 
MLP Base 0.5488 0.4423 0.5376 0.4886 0.5043 
PopGen Adapt 0.5088 0.8109 0.5516 0.5619 0.6083 


Table 5. AUROC for the Sri Lankan population. 


Method Hypertension Myxoedema Diabetes Asthma Average 
snpnet!! 0.4500 0.4631 0.4603 0.5871 0.4901 
PRS-CSx°®° 0.4852 0.4863 0.3991 0.5379 0.4771 
MLP Base 0.5898 0.5137 0.5952 0.5939 0.5731 
PopGenAdapt 0.5778 0.5902 0.6723 0.6091 0.6123 


Table 6. AUROC for the Hawaiian population. 


Method Hypertension Myxoedema Diabetes Asthma Average 
snpnet!! 0.6132 0.6148 0.5235 0.4857 0.5593 
PRS-CSx*0 0.5423 0.5881 0.4577 0.5087 0.5242 
MLP Base 0.6104 0.5162 0.5041 0.5666 0.5493 
PopGen Adapt 0.6135 0.5556 0.5791 0.5811 0.5823 


excluding completely some variants. Possibly, as the training data is mostly from the source 
domain, many of the SNPs excluded are the ones that remain useful for making the prediction 
on the target population. All this reflects the usefulness and need for domain adaptation 
techniques to be used when the data of the target domain is limited, like in the case of 
underrepresented populations. 

Furthermore, PRS-CSx has poor performance in most settings. We believe that the small 
number of samples of the target population still had an effect on this case, reflecting the 
value of incorporating unlabeled samples when the labeled data is scarce. Another possible 
limitation that could result in bad performance is the use of a relatively small number of 
SNPs, although this is shared across all four methods. 

Finally, we also observe that the supervised methods suffer less in the Hawaiian dataset, 
probably due to the higher number of labeled samples for training that are used in this scenario 
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and the fact that the Hawaiian target domain is less precise, resulting in less advantage for 
domain adaptation. The target domain, in this case, is less precise due to the nearest neighbor 
approach used to establish the labeled samples, as well as due to the fact that Pacific Islanders 
are admixed populations, resulting in more variability across the samples, as can be observed 
in Fig. 4. 


5. Conclusion 


In this work, we presented PopGenAdapt, a model that applies semi-supervised domain adap- 
tation techniques for genotype-to-phenotype prediction. We also proposed two approaches to 
set the target domain samples and evaluated the model to predict several disease outcomes 
in three different underrepresented populations. The results show that by using SSDA on un- 
derrepresented populations, the prediction performance can be improved over state-of-the-art 
supervised methods. Consequently, we show SSDA is a promising technique to help overcome 
health disparities in precision medicine by exploiting the availability of unlabeled data from 
underrepresented populations while still taking advantage of the greater magnitude of labeled 
data available from populations of European ancestry. 

Nonetheless, there are still some limitations and avenues for future work. Due to the lim- 
ited data on the underrepresented population we had available from the UK Biobank,? we did 
not study the influence the ratio of labeled and unlabeled samples could have on the attained 
performance, as using more samples for training would have left too few for validation and 
testing. Moreover, the scalability of the method to a larger number of SNPs also remains to 
be assessed. Further work on the approach could also include the possibility of learning from 
GWAS summary statistics instead of the SNP sequences or to also support continuous phe- 
notypes apart from categorical ones. Possibly, there is also room for improvement on the base 
model used, as more powerful deep learning architectures could be evaluated. Furthermore, 
considering the integration of PopGenAdapt on emerging paradigms such as federated learning 
or differential privacy?! could further enhance the applicability of the method in biomedical 
research and healthcare. 
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LA-GEM: imputation of gene expression with incorporation of Local Ancestry 
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Gene imputation and TWAS have become a staple in the genomics medicine discovery space; 
helping to identify genes whose regulation effects may contribute to disease susceptibility. However, 
the cohorts on which these methods are built are overwhelmingly of European Ancestry. This means 
that the unique regulatory variation that exist in non-European populations, specifically African 
Ancestry populations, may not be included in the current models. Moreover, African Americans are 
an admixed population, with a mix of European and African segments within their genome. No gene 
imputation model thus far has incorporated the effect of local ancestry (LA) on gene expression 
imputation. As such, we created LA-GEM which was trained and tested on a cohort of 60 African 
American hepatocyte primary cultures. Uniquely, LA-GEM include local ancestry inference in its 
prediction of gene expression. We compared the performance of LA-GEM to PrediXcan trained the 
same dataset (with no inclusion of local ancestry) We were able to reliably predict the expression of 
2559 genes (1326 in LA-GEM and 1236 in PrediXcan). Of these, 546 genes were unique to LA- 
GEM, including the CYP3A5 gene which is critical to drug metabolism. We conducted TWAS 
analysis on two African American clinical cohorts with pharmacogenomics phenotypic information 
to identity novel gene associations. In our IWPC warfarin cohort, we identified 17 transcriptome- 
wide significant hits. No gene reached are prespecified significance level in the clopidogrel cohort. 
We did see suggestive association with RAS3A to P2ARY 12 Reactivity Units (PRU), a clinical measure 
of response to anti-platelet therapy. This method demonstrated the need for the incorporation of LA 
into study in admixed populations. 


Keywords: Local Ancestry, Gene Expression Model, LA-GEM, PrediXcan, Gene Imputation, 
Population-specific Genetic Variations, Admixed Populations, Ancestry-specific Gene 
Associations 


1. Introduction 


It is widely acknowledged that large-scale genetic studies investigating human diseases have often 
failed to encompass the extensive diversity seen in global populations, as they primarily focus on 
individuals of European descent.'This insufficiency of ethnic diversity in such studies limits our 
understanding of the genetic underpinnings of human diseases and intensifies health disparities. 
Moreover, the paucity of ethnic diversity in human genomics research could lead to a potentially 
hazardous deficiency, or even errors, in our capacity to apply genetic research findings to clinical 
procedures or public health policies. 


t Contributed equally to the work. 


© 2023 The Authors. Open Access chapter published by World Scientific Publishing Company and 
distributed under the terms of the Creative Commons Attribution Non-Commercial (CC BY-NC) 4.0 License. 
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PrediXcan is one of the first and most popular methods used to predict gene expression levels in 
different tissues or cell types for use in transcriptome-wide association studies (TWAS).” The 
method leverages large publicly available multi-omic datasets that includes paired single nucleotide 
polymorphism (SNP) data and gene expression data from multiple individuals and tissues.7> By 
training a predictive model on these reference datasets, PrediXcan can predict the expression levels 
of a given gene in a new individual, based on that person's genetic variation. Outside data can be 
trained through various available methods.*+> There are various extensions to PrediXcan that have 
been developed which extend this method to multi-tissue TWAS and causal gene prioritization.>? 


In any association studies, undetected population stratification can lead to false-positive. Therefore, 
it is critical to implement appropriate correction to adjust these effects.!? One such measure, used in 
genome-wide association studies (GWAS), is the inclusion of principal components (PCs), with the 
first few PCs estimating global ancestry (GA) in the cohort. GA is largely directed by demographic 
history of the population. However, for admixed population the effects of nearby SNPs or epigenetic 
changes has been shown to have a significant effect of gene expression!'. Thus, local ancestry may 
be an important consideration in gene expression prediction. Here we have incorporated LA as 
predictor in PrediXcan framework to assess the if including this variable in the African American 
population resulting in the improved predictability of the gene models. 


2. Methods 


In this paper, we propose a modification to PrediXcan method titled LA-GEM (Local Ancestry 
based Gene Expression prediction Model) to incorporate local ancestry predictors (LA) along with 
cis region genetic variants in the development of gene expression prediction models. We have used 
our African American multi-omic hepatocyte dataset (N = 60) to create gene expression prediction 
models, however this method can be used on any multi-omic data from an admixed cohort in which 
local ancestry inference is available. 


2.1. Primary Hepatocyte Cohort 


Sixty-three African Ancestry (AA) primary human hepatocyte (PHHs) cultures were acquired. AA 
PHHs were either purchased from commercial companies (BioI VT, TRL/Lonza, Life technologies, 
Corning, and Xenotech), or isolated in-house from cadaveric livers. Livers with active cancer or a 
history of hepatocarcinoma were excluded from the study. To account for differences in PHH 
sourcing, transcriptomic data went through additional QC measures (i.e., PC visualization, batch 
correction) to ensure any differences from source and isolation method were corrected. PHHs were 
isolated from cadaveric livers using a modified two-step collagenase perfusion procedure previously 
described in Park et. al.!? Only hepatocyte cultures with RNA Integrity Number (RIN) over 8 and 
with sufficient RNA to conduct NGS were used in the study. 
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2.2. Genotyping, quality control and imputation 


DNA was obtained from around 1 million cells of each PHH culture using the Gentra Puregene 
(Qiagen) kit following manufacturer’s protocol. All extracted DNA samples were barcoded for 
genotyping. Illumina Infinitum Multi-Ethnic Global Kit was used for SNP genotyping and standard 
genotyping protocol was followed. SNPs were filtered out before imputation based on following 
criterion: (1) SNPs present on the sex and mitochondrial chromosomes. They were filtered out as 
they could alter the minor allele frequency (MAF) values (2) SNPs having A/T or C/G as it may 
introduce flip-strand issues. (3) SNPs with low genotype quality (call rate < 0.95). 


Using PLINK?, individuals with discordant sex information were identified using the sex check 
function and duplicates or related individuals were identified using the identity-by-descent (IBD) 
method. An IBD cutoff score of 0.125 was used, indicating third-degree relatedness or closer. No 
samples were removed after these QC steps. SNPs with MAF<0.05 were removed. Patient ancestries 
were confirmed using a principal component analysis (PCA) plot of linkage disequilibrium (LD) 
pruned genotype data. LD pruning was conducted to identify the principal dimensions of genetic 
variation between samples. Samples that did not cluster along the spectrum for AA within this PCA 
plot of raw genotype data were removed.'! One individual was excluded after sample and 
genotyping QC analysis, leaving 62 individuals. 


Genotypes were imputed by the TOPMed imputation server (version 1.6.6)12-14 using the TOPMed 
r2 reference panel, GRCh38/hg38 array build, and 0.3 estimated r2 (rsq) filter threshold. Post- 
imputation QC includes removal of SNPs with poor imputation quality scores (<0.8), failed Hardy- 
Weinberg equilibrium tests (p < 0.00001), and low MAFs (<0.05). This resulted in a total of 
5,189,820 SNPs included for model building. 


2.3. Local ancestry inference 


LA was inferred using RFMix (v.1.5.4). RFMix takes as input a set of reference panels (populations 
with known ancestry) and a set of test individuals, and uses a hidden Markov model to infer the 
most likely ancestry of each segment of the test individuals' genomes. The output of RFMix is a set 
of probabilities for each test individual, indicating the likelihood that a specific haplotype segment 
comes from one of the reference populations.'? In this analysis we use Yoruba (African Ancestry) 
and American white (CEU — European Ancestry) as our refences populations. 


2.4. RNA-sequencing and Quality Control 


Total RNA was extracted from each PHH culture three days after plating using the Qiagen RNeasy 
Plus mini kit. Samples with an RNA integrity number (RIN) less than 8 were removed from analysis. 
This resulted in the removal of 2 samples leaving 60 individuals at the end. Libraries were prepared 
for sequencing using the TruSeq RNA Sample Prep Kit, Set A (Illumina) per manufacturer’s 
protocol. The cDNA libraries were prepared and sequenced using either HiSeq2500 (Illumina) or 
HiSeq4000 (Illumina) instruments by the University of Chicago’s Functional Genomics core, 
producing single-end 50bp reads with approximately 50 million reads per sample. As two 
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instruments were used in this study, we were cognizant of potential batch effect and incorporated 
methods for correction as previously described.'* 


2.5. Gene Expression Quantification 


Gene expression was quantified using a collapsed gene model following the GTEx isoform 
collapsing procedure!>. To evaluate gene-level expression, reads were mapped to genes referenced 
with GENCODE(v.25) using RNA-SeQC. HTSeq supplied raw counts for gene expression analysis 
using Bioconductor package DESeq2(v1.20.0). Counts were normalized by regularized log 
transformation, batch correction was performed using ComBat-Seq!*, and PCA was performed 
using DESeq?. 


Gene expression was normalized by trimmed means of M-values normalization method (TMM) 
implemented in edgeR.'° Transcripts per million (TPM) was calculated by first normalizing counts 
by gene length and then by read depth.'’ Gene expression values were filtered based on expression 
thresholds < 0.1 TPM in at least 20% of samples and < 6 reads in at least 20% of samples. The 
expression values for each gene were normalized across samples with inverse normal 
transformation. To account for unmeasured confounding variables in transcriptome data, we used 
probabilistic estimation of expression residuals (PEER).'® 


2.6. LA-GEM Framework 
LA-GEM consists of mainly three steps: 
For gene expression prediction, a linear model was trained using reference panel that includes 


genotype, LA predictor, interaction predictor (interaction between genotype and LA predictor) and 
corresponding expression data”!’ using the following training model equation!”: 


Ps X Wasa + Wpåbp + Wcle + € 
a,b,c (1) 


where wa, w, and w, are the regression parameter needed to be trained, S = (S1, , Sz, ..., Sa) iS 
the genotype data in the cis region of interest, A = (41, Az, ..., A) is the local ancestry predictors 
for all SNP positions in the cis region and I = (J, 12, ..., Ie) is the Interaction predictor (I = S x A). 


Genetically regulated gene expressions are then determined using the above model for new dataset 
that include combination of genotype and local ancestry information using the following equation: 


Vg ~ X WaSa + WeAe + Wl 
def (2) 


Estimated genetically regulated gene expressions ĵ_g is then associated to the phenotype using the 
following equation: 


Z~ŷgt+tE 3) 
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LA-GEM prediction models were trained on 60 African American PHH samples followed by 5-fold 
cross-validation. Gene models with an average correlation p > 0.1 and P < 0.05 between predicted 
and observed Expression were deemed well predicted. 


Genetic variation (S) + Local Ancestry (A) + Observed transcriptome 
Interaction (l=SxA) m genes 
id] e |e | es | & [es] ss |. | en 
id, | 0.13 |0.19| 0.64 | 0.24 | 2.34] 0.18 2.79 
id, | -2.25 | 1.76] 0.48 | 1.27 | 1.89] -2.90 0.29 
2 id, | 1.36 | 2.06] -1.46 | -2.50 | 3.25] -3.40 3.43 
5 : : : : : : : : : 
3 
2 
ko 
£ 
c 
id, | 1.20 |2.26| 3.10 | 1.68 |2.12| 2.02 0.39 
id | e: |e | gs | ga | es | se | ~ |en] 
id, | 0.27 | 0.37 | 0.82 | 0.29 |2.59| 0.37 2.45 


1.64| 0.23 | 1.47 | 1.67 | -2.54 0.47 
Model training by LA-GEM id; | 1.58 | 1.95 |-1.21| -3.02 | 2.92 | -3.11 3.19 


using ElasticNet modeling Predicted transcriptome 


m genes 


: +e 
g 
b 
wo 
© 


[ id 


Additive model of gene expression 
trait trained in reference 


transctome data sets id, | 0.97 | 2.53] 2.89 | 1.35 | 2.33] 2.27 0.57 


y ~Y wasa + WpAp + Wele +E 
g 


abc 


m genes 


Transcriptome prediction 


Weights stored in database of 


id|s Su | A Atul. | 
prediction models F s ji 2 F A 
@ | idz | 0.94 1.86] 1 2 |0.94 3.72 
$ | id, | 1.00 2.00] 1 E Ie 2.00 
v : : : : 
2 : 
Using trained model on new dataset 2 
E 
idy | 0.36 0.79] 1 2 [0.36 1.58 


Fig. 1. Flowchart showing LA-GEM workflow. 


2.7. TWAS association using LA-GEM gene imputation. 


As a proof of concept, we use LA-GEM to impute hepatic gene expression in two clinical cohort to 
identify novel gene associations to drug response. As the expression of hepatic genes are especially 
important in platelet function and drug metabolism, we imputed gene expression of 1323 genes 
which were then used in the TWAS conducted using PrediXcan.? We prespecified a TWAS p-value 
of 3.8x10° as significant (0.05/1323). 


2.7.1. African American warfarin Cohort 


Through the International Warfarin Pharmacogenomics consortium ([WPC) we collect information 
from 340 African American patients on warfarin as well as 199 African Americans who were part 
of the University of Alabama Birmingham Warfarin cohort assess though dbGAP 
(phs000708.v1.p1). Briefly, clinical and demographic data on stable warfarin dose was collected, 
defined as the dose of warfarin needed to elicit and INR within therapeutic rage (2-3) for three 
consecutive clinical visits as previously described.”° 


2.7.2. ACCOUNT Clopidogrel cohort 


Through the ACCOuNT Consortium?! we recruited 180 African Americans on the anti-platelet 
drug, clopidogrel. All subjects included in the TWAS had a biomarker measure of clopidogrel 
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response, P2Y12 Reactivity Units (PRU). All subjects were on 75 mg of clopidogrel for at least 15 
days with inclusion and exclusion criteria as described previously.”! 


2.8. Log Ratio of Interaction Predictors 


To quantify the relative influence of interaction predictors in our LA-GEM model, we calculated a 
Log Ratio for each gene using the formula: 


Log Ratio = log2(Count of Interaction Predictors +1) — log2(Count of SNP Dosage Predictors +1) 


A positive Log Ratio indicates that a gene relies more heavily on interaction predictors, while a 
negative value suggests greater reliance on genetic dosage predictors. 


2.9. Code Availability 


The LA-GEM model was implemented in R and employs SNP-based local ancestry calculated using 
RFMix version 1.5.4. The source code is publicly available and can be accessed at 
https://github.com/pereralab/LA-GEM. 


3. Results 


We built two gene expression prediction models, LA-GEM and PrediXcan (using AA PHH as 
training). We assessed predictive performance using five-fold cross-validation (R2 of model 
performance). We found that LA-GEM was able to impute 1323 genes at a rho>0.1, p-value < 0.05 
(Average rho = 0.397) as compared to 1236 genes imputed well using the PrediXcan model 
(Average rho = 0.403) in the same dataset without LA (Fig. 2). The average number of predictors 
for LA-GEM is shown in Table 1. 


Table 1 — Summary table showing total number of Predictable 
genes and number of different Predictors used to train the model. 


LA-GEM 
Number of Predictable genes 1323 
Number of Predictors 71702 
Number of SNP Dosage Predictor 46028 


Number of Interaction Predictors (L.A X SNP Dosage) 25674 


LA-GEM 
N = 1323 


PrediXcan 
N = 1236 


Fig. 2. Venn diagram showing number of predictable genes in each of the model. 
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3.1. Gene list enrichment analysis of predictable genes 


KEGG Pathway enrichment analysis (Statistical overrepresentation test) was performed using 
g:Profiler*® for predictable genes (1323 genes) obtained from LA-GEM. The analysis yielded 
significant enrichments for several pathways as shown in Fig, 3, notably those linked to 
pharmacogenomics. Among these, three pathways were found to be prominently enriched: 
"Metabolism of xenobiotics by cytochrome P450" (KEGG:00980) with a fold enrichment of 3.37 
and an adjusted p-value of 0.00285, "Drug metabolism - cytochrome P450" (KEGG:00982) with a 
fold enrichment of 3.18 and an adjusted p-value of 0.01097, and "Drug metabolism - other enzymes" 
(KEGG:00983) with a fold enrichment of 2.74 and an adjusted p-value of 0.04196. 


Metabolic pathways @ 
Chemical carcinogenesis - DNA e 
adducts 
Metabolism of xenobiotics by @ 
cytochrome P450 
os Fold enrichment 
T Herpes simplex virus 1 © 3.5 
E infection 
= 3.0 
Aa 
i 2.5 
S Drug metabolism - cytochrome @ 
P450 2.0 
z 1.5 


Drug metabolism - other e 
enzymes 


Oxidative phosphorylation @ 


Phagosome @ 


1.6 2.0 2.4 2.8 3.2 
-log10(p-value) 


Fig. 3. Gene set enrichment of 1323 predictable genes obtained from LA-GEM. Y-axis show categories 
with their corresponding -logio(p-value) in the X-axis. Color shows the fold enrichment value for each of 
the processes. 


3.2. Genes unique to LA-GEM 


Among the 1323 predictable genes, 546 genes were found to be unique to LA-GEM model which 
were not reported by PrediXcan model as significant (Fig. 2). Out of the 546 unique genes, 2 genes 
(MME and LRRC37A2) were found to be strongly associated with global West African ancestry as 
previously reported!”. In addition, CYP345 CYPIA1, CYP4F2, CBR1, and UGT2A1 was also among 
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the genes unique to LA-GEM which is known to show significant variability in level of expression 
between population of different ancestry and are important to drug response”. 


3.3. Genes unique to PrediXcan 


Among the 1323 predictable genes, 459 genes were found to only in the PrediXcan model (Fig. 2). 
Out of the 459 genes, 6 genes (DHODH, SNAIL, RBBP9, ENSG00000271239, NPR2, and 
SLC39A11) were found to be strongly associated with global West African ancestry as previously 
reported. !” 


3.4. Genes common to LA-GEM and PrediXcan 


Among the 1323 predictable genes, 777 genes were found to be well imputed by both models. Out 
of these 777 genes, 4 genes (CDK18, GREM2, COL26AI and MMP20-AS1) were found to be 
strongly associated with West African ancestry as previously reported.'* The rho average for CDK18 
and GREM2 were higher in LA-GEM (0.48 versus 0.33 and 0.28 versus 0.26, respectively) but the 
inverse was true for COL26A/ and MMP20-AS (0.39 versus 0.44 and 0.32 versus 0.54 respectively) 
The rho average for these genes were evenly distributed around the diagonal (Fig. 4), suggesting 
one model did not outperform the other in these commonly imputed genes. For genes that were 
unique to the PrediXcan model, the average difference in rho between models was 0.42. For those 
gene that were uniquely to LA-GEM the average difference in Rho was 0.46. However, these 
differences in prediction accuracy were not a significant difference between the two groups of genes 
(p = 0.07). 
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Significant Genes 
® Common in LA-GEM and Predixcan 
0.8 @ Genes of Interest 
Unique to LA-GEM 
@ Unique to PrediXcan 


TMEM79 


RP6-7406.6 


PIAL be 
= ABCD3 of 
Y 
é RNF144A MME és 
PIBEN3 sefP21-949)7-8 £YP3A5 UST2AL ce 
¿RRC37A2 RT ° 


RABGAP1 


0.2 


LA-GEM - rho average 


e. 
FACNB3 SAMD9L 
ALOX15B 


SAYSD1 


PRMT3 


-0.2 0.0 02 04 06 08 
PrediXcan - rho average 

Fig. 4. Correlation plot between rho-averages of gene well predicted with LA-GEM and PrediXcan 
models. Top 10 genes showing the maximum rho-average difference between methods are labelled in dark 
blue color. Red line shows perfect correlation. Well predicted genes unique to LA-GEM model are shown 

in light blue. Well predicted genes unique to PrediXcan model are shown in grey. Well predicted genes 
common between LA-GEM and PrediXcan model are shown in light green. Genes of interest with 
pharmacogenomic relevance or which are associated with West African ancestry are shown in violet and 
are labelled in red. 


3.5. Differential Role of Interaction Predictors in LA-GEM and PrediXcan Models 


In the process of model training for LA-GEM, we observed differences in the role played by the 
type of predictors, especially interaction predictors, in model efficacy. Among the 546 genes 
uniquely imputed by the LA-GEM model, 137 genes (or approximately 25% of these significant 
genes) exhibited a positive Log Ratio of the Count of Interaction predictors. This observation 
underscores the relevance of interaction predictors as significant contributors in the unique 
imputation capability of the LA-GEM model. 


In contrast, among the 777 genes that were common between LA-GEM and PrediXcan, only 119 
genes (approximately 15% of these significant genes) had a positive Log Ratio of the Count of 
Interaction predictors. This relatively lower proportion suggests that the common genes might rely 
less on interaction predictors in the LA-GEM model than the genes unique to it. 
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The detailed distribution of the Log Ratio of the Count of Interaction predictors for these gene sets 
is depicted in Fig 5. This difference in the involvement of interaction predictors between genes 
unique to LA-GEM and those common with PrediXcan provides further insight into the 
distinguishing features of these models. 


Significant genes common to LA-GEM and Predixcan Significant genes unique to LA-GEM 


Log Ratio of Interaction predictors 
f 
f 


Gene Rank 
Fig. 5. Distribution of Positive Log Ratios of Count of Interaction Predictors in Genes Unique to LA- 
GEM and Common to LA-GEM and PrediXcan 


3.6. TWAS association to warfarin dose 


Using the IWPC warfarin cohort we imputed hepatocyte gene expression (restricted to those genes 
that were well imputed — N = 1325) and conducted a TWAS. The top associations are shown in the 
Manhattan plot (Fig. 6). Bonferroni corrected significant associations were found with 17 genes. No 
association was seen with known warfarin genes VKORCI, or CYP2C9 as these gene were not well 
imputed in our models. 
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Fig. 6. Manhattan plot of TWAS results. The figure shows the association of imputed gene expression to 
stable warfarin dose in the IWPC cohort. The x-axis show the relative genomic position of each gene 
tested (N = 1323) and the y-axis show the Log(10) p-value. The red dashed line marked the threshold of 
significance for this study. 


3.7. TWAS association to PRU in patient taking clopidogrel. 


Using the ACCOUNT cohort, we imputed the hepatic gene expression in 180 African American 
patients on clopidogrel. We found no transcriptome-wide significant associations. However, one top 
association showed RASA3 gene expression associated with increased PRU (p = 0.0014, Beta = 
0.61). This gene has known association to platelet aggregation.”? 


4. Discussion 


This study introduces a novel computational model, LA-GEM, designed to enhance gene expression 
prediction by integrating local ancestry (LA) predictors with cis-regional genetic variants. The 
development and deployment of such a model emerge from the understanding that complex trait 
prediction may be augmented by considering population-specific genetic variations. In many 
traditional models, such as PrediXcan, the unique genetic contributions of LA are not considered, 
potentially leading to overlooked associations.” 


Our findings revealed that LA-GEM improved gene expression prediction compared to PrediXcan 
in some genes, suggesting that the inclusion of LA predictors can effectively supplement traditional 
cis-regional genetic variants. This improvement was demonstrated by the imputation of 1323 genes 
at a rho>0.1, p-value < 0.05 by LA-GEM, compared to 1236 genes imputed by PrediXcan without 
considering LA. 
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Beyond these numbers, our study unveiled a set of 546 genes uniquely predicted by LA-GEM and 
777 genes common in both LA-GEM and PrediXcan AA model. Out of 1323, 6 genes (MME and 
LRRC37A2, CDK18, GREM2, COL26A1 and ENSG00000281655) were previously found to be 
associated with global West African ancestry and exhibited significant differential expression when 
compared to individuals of European descent.!* These genes are not only statistically significant but 
also relevant to pharmacogenomics. For instance, GREM2, a gene involved in developmental 
processes”* is also associated with allopurinol efficacy”. and MME, implicated in neuropeptide 
degradation?” and associated with ACE inhibitor-induced cough”, were amongst the uniquely 
predicted genes. Lastly variants in COL26A/ have been associated to Aspirin-intolerant asthma.?’ 


Importantly, this study highlights the valuable implications of integrating LA predictors in gene 
expression models for drug response studies. By significantly predicting genes such as CYP3A5, 
CYPIAI, CYP4F2, CBRI, and UGT2A1] - well-known contributors to drug metabolism and disease 
progression*”*3 - our model may aid in TWAS studies of inter-individual variations in drug 
responses and adverse drug reactions in African Americans. A particular emphasis should be placed 
on CYP3A5. This gene has been widely recognized for variability between different ethnic groups. 
The splice variant CYP3A5*3, associated with reduced enzyme activity, is less frequent in African 
populations, resulting in a functional enzyme in African populations. As most European carry the 
CYP3A5*3, the effect of this enzyme on drug response is not well accounted for in studies of 
European individuals. CYP3A5 is thought to contribute to drug efficacy and toxicity, including 
responses to immunosuppressants such as tacrolimus.*4>° 


We applied LA-GEM to the African American warfarin and clopidogrel cohorts, demonstrating its 
utility in clinical studies. The warfarin cohort revealed 17 genes with significant associations with 
warfarin dose requirement, providing novel potential genetic influencers of warfarin dosage 
response beyond the well-known VKORCI and CYP2C9 genes**?’. The most significate TWAS hit 
was GAS2L1/ (associated with increased warfarin dose requirement, p = 7.7x10°!°), which has 
previously been associated with thrombocytopenia in women.** Also, the gene SELENOO on 
chromosome 22 showed association to decrease warfarin dose requirement (p = 5.5x10°). A 
previous study in Sub-Saharan Africans found variants near this gene associated to increase R-6 
Hydroxy-warfarin metabolite measurement.*? 


In the ACCOUNT clopidogrel cohort, we discovered an association between RASA3 gene expression 
and increase P2Y12 Reactivity Units (PRU) level. While the most notable role of RASA3 involves 
platelet function and hemostasis”’, this gene's function is not limited to platelets and the 
bloodstream. It is broadly expressed in many tissues, including the brain, lungs, and kidneys, 
suggesting it might have additional roles outside of platelets. In cancer biology, the Ras and RAP 
GTPases regulated by RASA3 are often involved in tumorigenesis. For instance, inactivation of 
GAPs (like RASA3) can lead to overactive Ras signaling, which can contribute to the development 
of cancer.*? This gene has also been associated to pulmonary hypertension in Sickle Cell Disease.*! 
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In terms of computational efficiency, LA-GEM and PrediXcan showed similar performance during 
the model training phase. Specifically, for our limited dataset of 60 hepatocyte samples, both models 
completed the training within a time frame of approximately 2 to 3 hours. It's worth noting that the 
computational time is expected to scale linearly with the size of the sample pool, thus offering 
scalability as more comprehensive datasets become available. 


Several innovative methods have set the stage in ancestry inform gene expression prediction. 
Notable among these are METRO“, which enhances transcriptome-wide association studies 
(TWAS) through a likelihood-based inference framework, and MATS*®, which jointly analyzes 
samples from multiple populations to account for ancestral heterogeneity in gene expression effects. 
Additionally, a study by Lauren et al.*4 addressed the genetic architecture of gene expression across 
diverse populations, emphasizing the necessity for diverse population sampling in genomics. 
Despite their valuable contributions, none of these methods utilize SNP-based local ancestry as an 
intrinsic part of their predictive models. Our approach, LA-GEM, distinctively integrates SNP-based 
local ancestry predictors along with cis-regional variants to make more nuanced gene expression 
predictions. This unique aspect of LA-GEM not only adds a new layer of granularity to the existing 
methodologies but also paves the way for future explorations in this growing field. 


While our findings are promising, there are several limitations to our study. First, we constructed 
the LA-GEM models with a limited cohort of 60 hepatocyte cultures. This is reflective of the overall 
lack of comprehensive multi-omics data in the African American population. With greater amounts 
of data on which to build these models, we will be better able to predict tissue specific patterns in 
the under-represented populations. This is also evident by the much greater number of well imputed 
gene available for the GTEx liver model (N = 3356) which is built on 153 liver samples. It should 
be noted that only 12 of these sample have any African Ancestry. Second, it is clear that there are 
still genes that are better predicted without the addition of LA. This suggests that to comprehensively 
use TWAS in African American population may require both LA-aware as well as traditional gene 
imputation methods. Lastly, the validation of LA-GEM in other tissues and larger cohorts remains 
a crucial next step. Ultimately, the incorporation of LA predictors can contribute significantly to 
personalized medicine, paving the way for treatments and interventions more attuned to a unique 
admixed genetic background of African Americans. 


In conclusion, our study underscores the need for inclusion of LA in genomic methods. LA-GEM 
serves as a valuable tool in this endeavor, providing novel insights into the genomic architecture of 
complex traits in multiethnic populations, and highlighting the importance of considering local 
ancestry when predicting gene expression. The potential to uncover novel ancestry-specific gene 
associations can revolutionize our understanding of the interplay between genetics, disease, and 
therapeutic responses. 
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Surgery Patients 
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This work demonstrates the use of cluster analysis in detecting fair and unbiased novel 
discoveries. Given a sample population of elective spinal fusion patients, we identify two 
overarching subgroups driven by insurance type. The Medicare group, associated with lower 
socioeconomic status, exhibited an over-representation of negative risk factors. The findings 
provide a compelling depiction of the interwoven socioeconomic and racial disparities present 
within the healthcare system, highlighting their consequential effects on health inequalities. 
The results are intended to guide design of fair and precise machine learning models based 
on intentional integration of population stratification. 


Keywords: clustering; fairness; equity; explainability; feature importance; informatics. 


1. Introduction 


Advances in machine learning (ML) technologies paralleled with increased clinically relevant 
data availability have led to major progress in precision medicine over the past decade.! 
Data-driven solutions, particularly ML methods, are becoming integral to personalized predic- 
tive medicine as they can inform clinical decision support systems, generate accurate patient 
risk stratification models, and contribute to intelligent guideline development using high- 
dimensional complex medical data.? Indeed, ML-based approaches have generated robust. pre- 
dictive models in the diagnoses of several diseases such as cardiovascular diseases,* type II 
diabetes,* and early-stage Alzheimer’s disease® and for post-surgical outcomes and treatment 
response in several procedures including cardiac surgery® and spinal surgeries.””8 Thus, clin- 
icians can utilize this information to evaluate risk of poor diagnoses and adverse outcomes, 
assisting clinical decision making by providing personalized assessments of the benefits and 
consequences related to undergoing or delaying invasive procedures. 

The rates of spine surgery, an invasive procedure, have been steadily increasing over the 
past few decades.? With the proportion of the elderly population projected to dramatically 
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increase in the coming years, utilization of spinal procedures is expected to follow as degen- 
erative spine conditions become more prevalent.!° Spinal fusions generally require extensive 
muscle dissection and reconstruction of the spinal column, which typically necessitates signif- 
icant post-operative opioid consumption and comes with considerable post-operative risks.'! 
With the potential for long recovery periods and the risk of the development of opioid depen- 
dency as a result of these surgeries, outcome prediction in spinal fusion surgeries has become 
an important area of research. To accurately predict outcomes, it is crucial to consider patient 
diversity, which stems from various sources, including but not limited to biological, societal, 
environmental, and psychosocial factors.!? These sources of diversity can result in significantly 
different outcomes, ultimately affecting a patient’s long-term quality of life after surgery. 

For data-driven predictive models to become widely and safely adopted in clinical set- 
tings, key research challenges still remain to be resolved. These include assessing clinical 
heterogeneity and avoiding bias in decision-making. Complex ML algorithms have an inherent 
tendency for biased decisions that disproportionately impact underrepresented demographic 
groups leading to possible discriminatory outcomes.'!? This concern is frequently overlooked 
in study design, resulting in unequal treatment of minority individuals.'4 We seek to examine 
the intricate heterogeneity in clinical data to identify any differential patient subgroups, if 
present. This will enable us to mitigate bias in the ML decision-making for clinical systems. 

Cluster analysis has been applied in a wide range of applications as an exploratory tool 
to enhance knowledge discovery." It can help by identifying more homogeneous subgroups 
for effective ML models. The goal is to detect and characterize novel sub-types that exhibit 
differing clinical patterns and/or outcome trajectories that may benefit from different treat- 
ment options. Ultimately, the validity of any sub-grouping paradigm depends on whether the 
resulting sub-groups uncover/expose some biologic or genetic variation, which can be used to 
predict prognoses, recurrent risks, or treatment responses. However, most of the approaches 
employ a single clustering algorithm with limited explainability.!> !” To overcome these lim- 
itations, we introduce a novel clustering framework to examine and characterize a cohort of 
patients that have undergone elective spinal fusion surgery at Cedars-Sinai Medical Center. 


2. Data 


The dataset consists of electronic health records (EHR) of 5,214 elective spinal fusion (ESF) 
surgery procedures derived from 4,930 patients (ages 18-85) at the authors’ single institution 
from 2013 to 2022. Only patients who survived after surgery, with two or fewer procedures 
are included. If the second procedure was conducted within seven days of the first, the most 
recent is retained. Patients with a second procedure conducted after seven days but less than a 
year apart are excluded. Forty-five features from the patient’s health records were selected and 
integrated in the cluster analysis. These features span baseline characteristics /demographics, 
pre-surgery clinical labs, vitals, medication lists, past medical history, post-operative care, and 
social status, as guided by domain expert (C.T.W.). 

The race feature consolidates both self-reported race and ethnicity information. Self- 
reported ethnicity of “Hispanic”, regardless of race, is represented as “Hispanic”. Race 
designation of “Asian” or “Native Hawaiian or other Pacific Islander” are categorized as 
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“Asian/Pacific Islander”. “Native American or Alaska Native”, “Other”, “Patient declined”, 
“Unknown”, or missing are all consolidated as “Other”. Social status features include insurance 
type, marital status, smoking, and alcohol use. Patients with commercial or private insurance 
are grouped as “commercial” while Medicare, California’s Medicaid program (Medi-Cal), or 
all other government insurance are categorized as “medicare”. Vitals features include systolic 
blood pressure (SBP), body mass index (BMI) and pain score. We include the most frequently 
used lab value results from the EHR that had less than 50% of missing data (11: hemoglobin, 
white blood cell (WBC) count, red blood cell (RBC) count, platelet count, potassium, sodium, 
chloride, blood urea nitrogen (BUN), creatinine, calcium, and blood type) . Selected post- 
operative care features are discharge disposition, length of stay, and readmission status. Past 
medical history (PMH) features (yes/no) are derived by aggregating the ICD codes relevant 
to specific conditions of interest (metabolic, anxiety, chronic pain, mood, headache, nicotine, 
other psychiatric, opioid substance use disorder (SUD), alcohol SUD, cannabis SUD, and other 
SUD). Medication list features are derived based on usage of medications under 7 broad cate- 
gories, as defined by the domain expert. These include muscle relaxers, non-opioid analgesic, 
psychiatric, sleep, medication-assisted treatment, gabapentinoids, and “other”. 

The summary of the baseline characteristics is presented in Table 1. For a complete list 
of the medications that map to each medication feature as well as the ICD codes that map 
to each PMH feature, see supplementary file è. Data request approved by the Cedars-Sinai 
Honest Enterprise Research Brokers (HERB) committee. This research study was carried out 
under the guidelines and approval of the Cedars-Sinai Institutional Review Board. 


Table 1. Demographic summary of elective spinal fusion surgery patient sample (n = 5,214). 


Characteristic Distribution 


Age median: 67 range: 18 - 85; 65+: 57.59% 
Gender Male: 46.47% Female: 53.53% 
Race White: 75.66%, Hispanic: 10.32%, Black/African-American: 6.75%, Asian: 3.55%, Other: 3.72% 


Medicare: 45.09% (65+: 88.74%) (Medicare: 96.85%, Medi-Cal: 0.02%, Other government: 0.01%) 
Commercial: 53.93% (65+: 31.80%), No Insurance: 0.98% 

Single: 17.97%, Married: 63.57%, Divorced: 9.51%, Widowed: 6.23%, 

Significant other: 2.51%, Unknown: 0.21% 


Insurance type 


Marital status 


3. Methods 


To ensure a fair and unbiased model, we propose a robust automated system that integrates 
multiple clustering algorithms, ensemble internal validation metrics, automated ML (autoML)- 
driven explainability, and post-hoc univariate statistical analysis. 

The data curation steps involve the detection of erroneous, non-biologically plausible val- 
ues, and/or outliers. Domain expert guidance in conjunction with outlier analyses are applied 
to ensure mitigation of potential bias and possible human data entry errors. These values are 
dropped and imputed, rather than dropping the entire sample. Missing values are imputed 
using the multivariate feature imputation (IterativeImputer method in Python).'® All 45 fea- 


*Supplementary information is available at: https://github.com/EpistasisLab/PSB2024_spine/ 
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tures are not highly intercorrelated as evident from passing the correlation filter analysis using 
the Pearson and Spearman rank correlations (< 0.85). 

We perform an automated clustering method that incorporates hyperparameter sampling 
across various algorithms that permutes the distance type (Euclidean, Manhattan), and num- 
ber of clusters (k=[2:10]), when applicable. It exploits five individual algorithms (Spectral, 
Agglomerative, k-means, Birch, and Gaussian mixture).'® We also conduct an ensemble clus- 
tering model that leverages these individual methods using the mixture model consensus 
metric in OpenEnsemble.?°?! Our model includes TooManyCells (TMC) spectral hierarchical 
clustering method,”? for a total of seven methods with 68 permutations. To integrate TMC 
into the automated clustering pipeline, we implement an extension that aggregates cluster 
labels with multiple terminal cluster nodes starting at the root node. The depth of the tree 
partition serves as a TMC hyper-parameter. The optimal clustering output is determined us- 
ing the ensemble internal validation metric model introduced by Nguyen et al..?? The model 
assigns a final score based on a consensus of five metrics (Calinski-Harabasz, Davies-Bouldin, 
Silhouette score, Z, and Xie-Benie).?4 Each metric ranks its top 15 results and sets the remain- 
der to zero. The ensemble model assigns a final overall rank score to each clustering outcome 
based on the weighted sum of the individual ranking assignment of each metric. 

Key novelty of our clustering framework is that we utilize a model-agnostic approach to 
evaluate the feature importance and assess which key discriminant features are driving cluster 
separation with an autoML tool, TPOT.?° TPOT evaluates the informative contributions of 
features to clustering results by predicting cluster labels with each feature independently. In 
contrast to the current state-of-the-art methods for evaluating feature importance (such as 
SHapley Additive exPlanation,?° Permutation feature importance, Gini impurity in Random 
Forest?”), TPOT overcomes the single model limitation as it searches and optimizes across mul- 
tiple ML algorithms. For each feature, we run the TPOT optimization (across 13 different clas- 
sifiers configuration), and extract the best-performing model performance as the feature impor- 
tance metric. This provides insight into the key discriminant input features and guides the next 
steps of analysis. Visualization of results is performed using ISOMAP?! and TMC dendograms. 
Code for all the methods are available at (https://github.com/EpistasisLab/PSB2024_spine). 

Univariate global statistical tests are conducted, as post-hoc analyses, to assess which 
features exhibit differences among the cluster groups. The method of analysis differs depending 
on the measurement scale of the feature. Features with significant test results suggest utility 
in clustering. For continuous features, we test for normality using Shapiro-Wilk tests. All 
features are non-normally distributed. Thus, we employ non-parametric Mann-Whitney tests 
(or Kruskal-Wallis tests in case of multiple groups). For categorical and binomial features, we 
use Chi-square tests of independence. The resulting p-values of these tests are corrected for 
multiple testing using the Benjamini-Hochberg procedure. 


4. Results 
4.1. Entire ESF sample is stratified by socioeconomic factor of insurance. 


Upon evaluating ensemble clustering on the overall cohort of 5,214 surgeries, Table 2A shows 
k-means with two clusters consistently outperforms other methods across internal validation 
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Table 2. Top ranked results for 1% and 2”? order clustering based ensemble validation rank scores. 


Output [Cluster sizes] CH (rank) Db (rank) I (rank) Sil (rank) Xb (rank) hike 
A. Clustering on entire cohort 

kmeans-2 [2852, 2362] 550.01 (15 3.03 (14) 0.619 (15) 0.098 (15) 2.348 (15) 74 

GaussianMixture-2 [2732, 2482] 549.56 (14 3.03 (15) 0.619 (14) 0.097 (14) 2.348 (14) 71 

Spectral (euclidean)-2 [2863, 2351] 533.62 (13 3.08 (11) 0.597 (13) 0.095 (13) 2.436 (13) 63 
B. 2” order clustering on C1 group 

kmeans-2 [1872, 980] 202.13 (15 3.57 (0) 0.398 (15) 0.089 (14) 3.180 (3) 47 

Spectral (manhattan)-2 [1931, 921] 168.17 (13) 3.86 (0) 0.341 (14) 0.081 (13) 3.705 (0) 40 

Mixture model-2 [1638, 1214] 168.23 (14) 4.06 (0) 0.305 (13) 0.074 (12) 4.142 (0) 39 
C. 2” order clustering on C2 group 

kmeans-2 [1476, 886] 204.92 (15 3.27 (11) 0.503 (15) 0.094 (15) 2.700 (15) 69 

GaussianMixture-2 [1474, 888] 204.88 (14) 3.27 (10) 0.503 (14) 0.094 (14) 2.702 (14) 64 

Mixture model-2 [1473, 889] 195.48 (13 3.36 (1) 0.480 (13) 0.094 (13) 2.834 (13) 54 


metrics. Top ranking methods (k-means, Gaussian Mixture, spectral, TooManyCells) return 
similar 2-cluster partitions and display high consistency as top performers across all five met- 
rics. Subsequent analyses are conducted on the k-means-2 result (C1 and C2). The visualiza- 
tion of the subgroups is shown using both ISOMAP (Figure 1(a)) and TMC (Figure 1(b)). 
Note: TMC performs its embedded technique (spectral hierarchical clustering) prior to visu- 
alization, hence, not representing C1 and C2 separation exactly. TPOT feature importance 
analysis reveals that insurance type, a potential socioeconomic factor, is most important to 
cluster separation explainability (100% balanced accuracy (B-Acc.)). Age, discharge disposi- 
tion, and PMH metabolic are of less importance (79.1%, 64.2%, 62.7% B-Acc. respectively). 
Mapping the insurance type label with TMC dendrograms confirms this as well (Figure 1). 
Cluster C1 consists of all patients with “commercial insurance” and 40 of “no insurance” while 
C2 has all patients on “medicare insurance” and 11 with “no insurance”. 


Mc 
Hc 
e 


(a) ISOMAP representation (b) C1 & C2 using TMC (c) TMC with insurance 


Fig. 1. Visualization of k-means-2 results on entire cohort. 


Age is a determinant for medicare eligibility (65+) in the USA. Thus, we conduct univariate 
statistical analyses between C1 and C2 (insurance-driven clusters) as well as between and 
within age-stratified subgroups. Figure 2 illustrates the experimental design of these analyses. 
The pairwise comparisons are conducted as follows: Exp 1: C1 vs. C2; Exp 2: 65+ subgroups 
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of C1 and C2 i.e., Cl > 65 vs. C2 > 65; Exp 3: 65- subgroups of C1 and C2 i.e., C1 < 65 vs. 
C2 < 65; Exp 4: within C1: Cl > 65 vs. C1 < 65; Exp 5: within C2: C2 > 65 vs. C2 < 65. 


4.1.1. Univariate analysis reveals health disparities associated with insurance types. 


Figure 3 summarizes the key features that differ significantly at the entire cohort level between 
C1 and C2 and when age-stratified (Exp 1, 2, and 3). Nine features display age-independence 
as they are statistically different across all three comparisons (Figure 3a). These are race, 
marital status, discharge disposition, hemoglobin, platelet count, RBC count, potassium, and 
two PMH features (metabolic and anxiety). We also observe that there are some features that 
are not different between Cl and C2 (Exp 1), but do exhibit significant differences within 
the 65- comparisons (Exp 3) (Figure 3b). These features (PMH features of pain score, other 
psychiatric disorders, nicotine use, headache, other SUDs, and use of non-opioid analgesics) 
imply some possible health disparities between the two socioeconomic driven groups after 
accounting for the age factor. (Note, an additional significant feature, PMH of other SUD, 
isn’t shown in the figure, as it affects less than 5% of the overall population.) There are no 
features that are significant only between 65+ subgroups (Exp 2) and not at the entire cohort 
level (Exp 1). See Supplementary file > for complete details of all the pairwise comparisons. 
The analysis also reveals some features 
that are significant across all three compar- 


isons (Exp 1, 2 and 3), which are also sig- C paasei 
nificant within C1 and C2 when stratified 
by age (Exp 4 and 5). These include race, ist order Ca)— = ®© 


platelet count, RBC count, marital status, PCat 
discharge disposition, and PMH features 


of metabolic and anxiety (see Supplemen- (ED Se! c2 q 


tary file’). Features such as hemoglobin 


N Exp 4 Sex 57 
are significant within C1 (Exp 4) but not a ae ane 
C2 (Exp 5). All PMH features are signif- ee J 
icantly different within C2 age-stratified 
groups. Overall, negative health factors, 2nd order (C 


such as lower hemoglobin, RBC, platelet 
count, potassium levels, and higher inci- 
dence of metabolic disease and anxiety are Fig. 2. Experimental design of cluster analyses 
associated with C2, indicating socioeco- and pairwise comparisons. 


nomic health disparities. 


X 7 x 7 
“Exp 6^ “Exp 7~ 


4.1.2. Adverse outcomes are disproportionately observed in minority racial groups. 


From the pairwise comparisons (Exps 1-5), race is consistently significant. The 65- population 
in C2 had a larger proportion of non-white patients (60% compared to 73% in C1), with 
the disparity being most prominent in the Black/African-American demographic with a wide 


>Supplementary information is available at: https: //github.com/EpistasisLab/PSB2024_spine/ 
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A Age (years) Race B Pain Score 
C1 vs. C2 Ciz65vs.C2265 C1<65vs.C2<65 C1 vs. C2 C1265 vs.C22=65 C1<65vs.C2<65 C1 vs. C2 Ciz65vs.C2265 C1<65vs.C2<65 
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Platelet Count (thousands/yL) Discharge Disposition Nicotine SUD 
C1 vs. C2 Ciz65vs.C2265 C1<65vs.C2<65 C1 vs. C2 C1265vs.C2265 C1<65vs.C2<65 C1 vs. C2 C1z65vs.C2265 C1<65vs.C2<65 
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c c2 c1 c2 c1 c2 c1 c2 C1 c2 c1 c2 c1 c2 c1 c2 c1 c2 
p<0.001 p=0.05 p= 0.002 p<0.001 p<0.001 p<0.001 NS NS p=0.001 
Red Blood Cell Count (millions/uL) Metabolic Disease Headache 
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ct c2 C1 c2 c1 c2 C1 c2 ct c2 c1 c2 c1 c2 c1 c2 ci c2 
p<0.001 p<0.001 p<0.001 p<0.001 p<0.001 p<0.001 NS NS p=0.001 
Potassium (mmol/L) Anxiety Non-Opioid Analgesic Use 
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Fig. 3. Pairwise comparison results of selected features for Exp 1, 2, & 3 (C1 vs. C2; C1 > 65 vs. 
C2 > 65; Cl < 65 vs. C2 < 65) significant across all in (A), and only for Exp 3 in (B). 


percentage gap of 16% vs 5.7% (Figure 3). Given a predominantly White cohort, it is im- 
portant to highlight that complex ML models may inadvertently neglect pattern associations 
within minority classes. We recognize the importance of deeper exploration into race since our 
clustering model could potentially marginalize significant patterns linked to minority groups. 
This section further examines race-related differentiation at both cohort and cluster levels. 
We observe significant differences for post-operative care outcomes (discharge disposition, 
length of hospital stay (LOS), and readmission rate) between race groups in multiple com- 
parisons (Figure 4). At the entire cohort level, Blacks exhibit a higher proportion of adverse 
outcomes in all scenarios (see Figure 4). The “Other” group (Native American or Alaskan 
Native, Other, patient declined, and unknown) also demonstrates increased rates of adverse 
outcomes for discharge disposition and LOS. We subsequently examine the cluster and age- 
stratified groups to identify whether the adverse outcome over-representation in Blacks and 
“Other” remain independent of insurance and age. Likewise, for readmission rate and dis- 
charge disposition, the higher adverse outcome effect remains significant in C2, specifically in 
the 65+ subgroup. However, LOS is independent of race in C2 as adverse outcomes become 
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Fig. 4. Pairwise comparisons of clinical outcomes across n by race. 


more prominent for all groups, likely denoting a combined effect of socioeconomic disparities 
and advanced age. Race appears to also be an important factor in C1 with Blacks and “Other” 
having higher LOS (> 7 days) and discharges to other than home compared to other groups. 
These results, although limited due to small non-white sample sizes, indicate that race is an 
important discriminant of health outcomes for ESF surgery. 


4.2. Second-order clustering reveals clinical and demographic heterogeneity 


Given the overwhelmingly distinct clusters driven by socioeconomic factors, we reiterate the 
automated clustering on C1 and C2 separately to further examine the insurance-associated 
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heterogeneity. This is denoted as second-order clustering (Exp 6 and 7) in Figure 2. 

The top-ranking clustering results for both experiments are illustrated in Table 2B and 
C. We observe that in both instances, the k-means-2 result is the most optimal method. For 
C2, all the high-ranking algorithms unanimously identified 2-cluster solutions with minor size 
distribution differences. In contrast for C1, though the 2-cluster solution is the best method 
overall, there is more variance among the metrics. Visual inspection of ISOMAP decomposition 
and TMC dendrograms with cluster labels confirm that C2 clusters (C2:cl and C2:c2) display 
more separation compared to C1 (Cl:cl and Cl:c2) (Figures 5 and 6). 


Ct:ct 
6 © W c1::2 


(a) ISOMAP representation (b) TMC with kmeans-2 (c) Marital status TMC 


Fig. 5. Optimal clustering result on C1 subgroup:kmeans-2 optimal result. 


e 
1 e 
i mcac2 / 
2 1 1 2 3 4 a ae 
(a) ISOMAP representation (b) TMC with kmeans-2 (c) Discharge TMC 


Fig. 6. Optimal clustering result on C2 subgroup: kmeans-2 optimal result. 

TPOT feature importance analysis identifies marital status as highly discriminant for 
Cl:cl and Cl:c2 groups, and discharge disposition for C2:cl and C2:c2, both with 100% B- 
Acc. For C1, Age trails with 57.4% B-Acc. For C2, LOS, hemoglobin, and readmit predicts 
label with 64.2%, 61.8%, and 60.9% B-Acc. respectively. This is illustrated using the TMC 
dendrograms overlaid with the discriminant features in Figures 5(c) and 6(c). C1:c2 consists 
entirely of all married patients while Cl:cl contains all others. In C2, the two clusters (C2:cl 
and C2:c2) are stratified primarily by discharge disposition. C2:cl (n = 886) consists mainly 
of patients discharged to rehab and skilled-nursing facilities (SNF) while C2:c2 (n = 1,476) is 
comprised of almost all home discharge patients (99.86%). We also observe that the second- 
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Fig. 7. Selected significant features from univariate analysis of pairwise comparisons on second-order 
clustering on (A) cluster C1 (Exp 6) and (B) cluster C2 (Exp 7). 


order clustering yields subgroups of disproportionate sizes (large vs. small) compared to the 
first-order clustering. 

From univariate analysis results (Figure 7), statistically significant differences are observed 
for both comparisons (Exp 6 and 7) for age, race, gender, discharge disposition, readmission, 
LOS, platelet count, RBC count, hemoglobin, BUN, creatinine, chloride, calcium, sodium, and 
PMH features of anxiety and mood of which selected features are illustrated in 7. Overall, we 
observe that C2 displays a higher level of complexity and divergence. The features that drive 
the C2:cl vs. C2:c2 divergence are LOS > 7 days (44% vs 13% ), readmission rate ( 29% vs 
11% ), and lower median hemoglobin values (11.4 vs. 12.3) (Figure 7B). 


5. Discussion 


In this study, we elaborate our commitment towards constructing equitable and unbiased ML 
models. Our initial intention was the development of a predictive model specific to elective 
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spine fusion surgery, however, during the course of our investigation, we identified the ne- 
cessity for deeper understanding of potential disparities present within our dataset to more 
accurately address clinical inquiries. The manifestation of bias within ML algorithms through 
data sources has been substantially highlighted in prior literature.!?:?9°° To combat this, we 
employ a robust automated multiple clustering approach to scrutinize our dataset for poten- 
tial bias factors, prior to developing an ML model. Investigation of subpopulation structure 
in clinical cohorts is an important area of research and has significant implications for pa- 
tient care and treatment. However, the methodologies used in most studies!® !” are limited 
in that they usually implement a single clustering technique without conducting exploratory 
investigations of their results, potentially overlooking components driving heterogeneity. Our 
framework addresses these shortcomings by employing automated cluster analysis with hy- 
perparameter tuning and a multi-metric performance score. The framework, enhanced by 
autoML-driven feature importance estimation along with univariate analysis, allowed us to 
uncover and explain drivers of population divergence. We demonstrate its capabilities in un- 
covering inherent patterns of heterogeneity in patients undergoing ESF, an invasive medical 
procedure that is associated with risks of many adverse outcomes.!! 

The cluster analysis uncovers two diverse subgroups (C1 and C2), each exhibiting unique 
characteristics, driven mainly by socioeconomic factors (insurance type and race). It is impor- 
tant to note that the entire ESF sample is almost evenly split between insurance types (54% 
commercial insurance). This indicates increasing equity of access as patients with medicare 
coverage have historically experienced limited access to certain medical procedures, including 
elective spinal fusion.'° However, disheartening but not surprising, is the observed significant 
health disparities in the cohort driven by socioeconomic factors. Similarly, there are several 
recent studies?! highlighting that racial minorities, and those with lower socioeconomic sta- 
tus, are at higher risk of adverse outcomes. The C2 subgroup contains all medicare insurance 
patients and is characterized by an increased proportion of minority groups compared to Cl, 
though the overall sample is primarily White (Table 1). C2 patients have higher occurrences 
of non-home discharge dispositions, clinically remarkable past medical histories, especially 
with respect to metabolic-related diseases and anxiety, as well as clinical lab values associated 
with poor prognoses (Figure 3). In particular, the under 65 C2 patients (266) have signifi- 
cantly higher pain scores and a higher prevalence of nicotine substance abuse, headaches, other 
psychiatric disorders, and conditions already noted (metabolic and anxiety). These character- 
istics are not surprising, however, what is notable is that the socioeconomic factor of insurance 
overwhelms the clustering results, compelling us to adjust for it prior to characterizing the 
underlying heterogeneity with second-order clustering on C1 and C2 separately. 

Both C1 and C2 contain one of two sub-clusters that are smaller and associated with poor 
health outcomes (C1:c1 and C2:c1). Interestingly, C1 is stratified by marital status with C1:c2 
consisting of all married patients while C1:c1, its adverse outcome subcluster, is made up of all 
other marital status groups (Figure 5(c)). C2 is stratified by discharge disposition. C2:cl, its 
adverse outcome group, consists of almost all non-home discharged patients (99.86%) (Figure 
6(c)). Despite the unique characteristics that differentiate the adverse outcome subclusters 
(Cl:cl and C2:c1), they share striking similarities as both are comprised of patients presenting 
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suboptimal values of numerous labs, PMH of mood disorder, poor outcomes (LOS, discharge, 
and readmission), and higher proportions of minority patients (Figure 7). Though similarities 
exist, the proportions of patients with negative indicators of health (and their magnitudes) 
are greater in C2:cl compared to Cl:cl (Figure 7). This is also true for race as C2:cl has a 
higher proportion of minority patients. This aligns with the validation metrics analysis (‘Table 
2) which indicates more separation in C2 compared to C1. In addition, C2:c1 has significantly 
suboptimal WBC count, PMH of metabolic and chronic pain, and use of gabapentin while 
Cl:cl has more prominence of PMH of anxiety, alcohol, other psychiatric disorders, nicotine 
use, and other SUDs (Figure 7). We acknowledge that these characteristics are probably due to 
a combination of social, environmental, and biological factors. However, interestingly, overall 
better prognoses are strongly associated with “married” status (Figure 7A). 

The conspicuous racial partitioning observed at both levels of clustering highlights the 
importance of conducting thorough exploratory analysis and incorporation of fair algorithms 
in ML. The race-stratified analysis further validates findings on existing socioeconomic dis- 
parities within the ESF sample, especially for post-surgery event outcomes (Figure 4). All 
relatively poor outcome subgroups (C2 as a whole, under 65 age-stratified cohort in C2, 
Cl:cl, and C2:cl) have significantly more minority patients (Figures 3,7). Interestingly, the 
over-representation of Blacks and “Other” are similar in both Cl:cl and C2:cl (Blacks: ~10% 
and “Other”: +4.5% (which includes Native Americans) ). This is concerning given the overall 
low percentage of Blacks (6.75%) and “Other” (3.72%) in the entire sample. Note that “Other” 
also includes self-reported race entries of “Other”, “patient declined”, and “Unknown”, which 
are often associated with privacy, self-identity /profiling, and trust concerns.*? Constructing 
“Other” with Native-Americans, Alaskan Natives, and individuals with no reported race is not 
optimal and was done due to small sample sizes. Nevertheless, identifying higher proportions 
of these individuals in the adverse risk clusters is likely driven by cumulative disparity factors 
associated with these groups. These implications are important as identifying patients with 
needs for specialized care could lead to substantial improvements in clinical outcomes. 

Complex pattern recognition models can sometimes overlook minority groups due to im- 
balanced data, potentially leading to biased results and unfair outcomes.!? Here, we showcase 
a framework that mitigates these issues by incorporating information about heterogeneous 
subgroups into the clinical risk score model. With thorough evaluation and validation, our 
discovery from clustering results has the potential to be actionable in clinical settings, allow- 
ing diverse groups of patients and clinicians to receive more precise estimates of treatment 
success and risk of developing adverse effects. This approach can be transferred to other do- 
mains that require clinical decision support. Moreover, as we observe racial and socioeconomic 
indicators playing key roles in explaining disproportional adverse effect distribution, it is im- 
portant to continue advocating for more fair healthcare policies, especially for preventative 
care access. By identifying socioeconomic status and race as significant determinants of health 
outcomes, our two-tier approach averts a potential scenario of introducing health disparities 
due to algorithmic bias. We are enthusiastic about the development and deployment of our 
methodology in predictive modeling in clinical settings to assist surgeons and patients in real- 
time decision-making regarding the most efficacious ESF surgery options. These clusters could 
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be utilized in a sampling scheme to mitigate bias in ML models aimed at predicting outcomes, 
by incorporating feature engineering based on the cluster labels into the model as well as 
exploring risk score ML models with discovered population stratification. This study presents 
a compelling illustration of the heterogeneity within the healthcare system and underscores 
the need for personalized medicine as a strategic approach to enhance healthcare and reduce 
health disparities. Therefore, we strongly advocate for others to employ a similar rigorous 
approach to data integration in order to better comprehend potential biases. 
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Evidence of recent and ongoing admixture in the U.S. and influences on health and disparities 
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Many researchers in genetics and social science incorporate information about race in their work. 
However, migrations (historical and forced) and social mobility have brought formerly separated 
populations of humans together, creating younger generations of individuals who have more complex 
and diverse ancestry and race profiles than older age groups. Here, we sought to better understand how 
temporal changes in genetic admixture influence levels of heterozygosity and impact health outcomes. 
We evaluated variation in genetic ancestry over 100 birth years in a cohort of 35,842 individuals with 
electronic health record (EHR) information in the Southeastern United States. Using the software 
STRUCTURE, we analyzed 2,678 ancestrally informative markers relative to three ancestral clusters 
(African, East Asian, and European) and observed rising levels of admixture for all clinically-defined 
race groups since 1990. Most race groups also exhibited increases in heterozygosity and long-range 
linkage disequilibrium over time, further supporting the finding of increasing admixture in young 
individuals in our cohort. These data are consistent with United States Census information from broader 
geographic areas and highlight the changing demography of the population. This increased diversity 
challenges classic approaches to studies of genotype-phenotype relationships which motivated us to 
explore the relationship between heterozygosity and disease diagnosis. Using a phenome-wide 
association study approach, we explored the relationship between admixture and disease risk and found 
that increased admixture resulted in protective associations with female reproductive disorders and 
increased risk for diseases with links to autoimmune dysfunction. These data suggest that tendencies in 
the United States population are increasing ancestral complexity over time. Further, these observations 
imply that, because both prevalence and severity of many diseases vary by race groups, complexity of 
ancestral origins influences health and disparities. 


Keywords: Disparities; Electronic Health Records; Health Outcomes; Admixture. 
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distributed under the terms of the Creative Commons Attribution Non-Commercial (CC BY-NC) 
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1. Introduction 


Genetic admixture has previously been used to identify geographic variability and historical 
migration patterns across several human populations!!! and to investigate the genetic basis of 
diseases!*"!*, Two studies have shown temporal increases in heterozygosity due to urbanization, one in 
a Croatian population!’ and one in a U.S. population of European ancestry'®. However, these studies of 
admixture have not connected migratory or urbanization patterns to health outcomes. One study that 
performed a meta-analysis on populations of individuals with European American and African 
American ancestry found a positive association between levels of heterozygosity and mortality in 
humans!’. However, this study investigated only a single outcome and omitted the impact of temporal 
trends in admixture and heterozygosity on epidemiological outcomes. Further, these studies have not 
explored variability in admixture with respect to age or generational trends. 

Understanding temporal changes in ancestry and heterozygosity has important implications for 
individual- and population-level health in humans that remain unexplored. Using human population 
genetic data to study the connection between ancestry, heterozygosity, and health is ideal due to the 
substantial number of individuals with genetic data linked to electronic health records (EHR)!*!”. 
Further, many diseases and their etiologies recorded in EHRs are known in detail and are well classified, 
facilitating the estimation of the relationship between heterozygosity and disease risks. 

In our cohort of 35,842 individuals from the Southeastern U.S., we investigated temporal changes 
and variance of admixture by age with de-identified information from the EHR on race, ethnicity, and 
year of birth linked to genotype data from the Illumina HumanExome array in Vanderbilt University 
Medical Center’s biorepository resource (BioVU)!®*. In addition, we used a phenome-wide association 
study (PheWAS?”°) to connect genetic data with the clinical phenome capturing clinical disease 
outcomes in BioVU. This approach allowed us to investigate the relationship between increased 
ancestral complexity and disease risk. Our study provides important insights into the changing 
landscape of genetic admixture in a clinical context. 


2. Methods 


2.1. Study Population 


Individuals were selected from the BioVU DNA repository which links clinical data from de-identified 
electronic medical records to DNA samples obtained from patients at Vanderbilt University Medical 
Center (VUMC)!8. Each individual’s race was designated in the Electronic Health Record (EHR) as 
either White, Black, Asian, Pacific Islander, American Indian/Alaska Native, or declined/unknown, and 
an ethnicity of Hispanic/Latino, Not Hispanic/Latino, or declined/unknown. BioVU also contains third- 
party designated race, which is a good predictor of genetically estimated ancestry in this database”. 
This study of de-identified data was determined to be non-human subject research by the institutional 
review board (IRB) of Vanderbilt University, Nashville, TN. 
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2.2. DNA Extraction and Genotyping 


All DNA samples were isolated from whole blood using the Autopure LS system (QIAGEN Inc., 
Valencia, CA). Genomic DNA was quantitated via an ND-8000 spectrophotometer and DNA quality 
was evaluated via gel electrophoresis. Individuals were genotyped using the Illumina Infinium 
HumanExome Array [12v1-1] (Illumina Inc., San Diego, CA). The data were processed for genotype 
calling using Illumina’s Genome Studio (Illumina Inc., San Diego, CA). 


2.3. Genotyping Quality Control 


Data on 240,117 SNPs and 35,842 individuals (16,289 males and 19,552 females) were available prior 
to implementation of quality control (QC) measures. No individuals were excluded for low genotyping 
efficiency (<98%). 6,599 SNPs were excluded for low genotyping efficiency (<98%) and 71,667 SNPs 
were monomorphic. Twenty-six individuals (14 EHR males and 12 EHR females) were excluded for 
inconsistent genetic and database sex. After QC, 163,135 SNPs remained for analyses in 35,456 
individuals. No SNPs were removed for deviations from Hardy-Weinberg equilibrium. 


2.4. Quantification and Statistical Analyses 


Descriptive statistics on demographic and clinical characteristics were expressed as means with 
standard deviation or median with interquartile range for continuous covariates and as frequencies or 
proportions for categorical data using SPSS statistical software (IBM Corporation, Armonk, NY) (Table 
1). 


Table 1. Summary of demographic characteristics of study individuals 


Race* White Black Hispanic/Latino Asian Other/Unknown 
N (%) 28,723 4,129 (11.5) 550 (1.5) 270 (0.75) 2,170 (2.8) 
(80.1) 
Male % 46.9% 38.9% 43.6% 42.5% 39.6% 
Birth Year 
Mean (SD) 1957 (24.1) 1968 (26.0) 1976 (25.6) 1959 (19.8) 1955 (20.2) 
Median 1951 1964 1977 1958 1953 
(IQR) (1938-1971) (1948-1995) (1957-2000) (1944-1972) (1940-1967) 
Range 1905-2012 1908-2011 1918-2012 1915-2010 1906-2012 


*Non-overlapping categories 


A subset of 2,678 ancestry-informative markers (AIMs) were selected for subsequent analysis. We 
chose AIMs from the ExomeChip selected to have strong differences between African and European 
ancestry populations as well as between Asian and European ancestry populations. AIMs were used 
instead of pruned SNP data due to the particular composition of the ExomeChip platform, which was 
designed with a panel of AIMs to enable evaluation of ancestry. 
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2.4.1. Principal component analysis 


EIGENSTRAT v6.0.1 software was used to conduct principal component analysis (PCA) to estimate 
continuous axes of ancestry from AIMs in all populations together??. SPSS was used to create plots of 
individuals, stratified by birth year which demonstrate trends in changing demography in individuals 
over time as shown in Fig. 1. 
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Fig. 1. Principal components plots of study individuals by decade of birth between 1905-2013. Principal 
components were calculated in EIGENSTRAT and anchored to populations in 1000 Genomes. Sample size 
information can be found in Table 1. 


2.4.2. STRUCTURE analysis 


STRUCTURE software v2.3.3?24 was used to quantify ancestry in combined study and 1000 Genomes 
Project Phase 3 individuals using the AIMs”*. We estimated proportions of ancestry assuming ancestral 
clusters (K) ranging from one to 16, where 16 is the number of sub-populations in the 1000 Genomes 
Phase 3 data plus two. We assumed unlinked SNPs and used 5,000 iterations of burn-in and 10,000 
iterations for analysis without providing population information to the software. We observed that the 
—log-likelihood of the data given K did not vary significantly for K’s greater than three and observed 
that K’s greater than three primarily subdivided the European populations (data not shown). The three 
STRUCTURE clusters corresponded to African, Asian, and European ancestry based on comparisons 
to the 1000 Genomes reference data (data not shown). 
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Each of the three derived proportions of continental ancestry from STRUCTURE were regressed 
onto birth year using generalized additive models with integrated smoothness estimation (GAM)*> 


implemented in the R package 
mgcv in all study individuals 
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Fig. 2. Proportions of ancestry derived by STRUCTURE analysis of 
AIMS for study individuals plotted against birth year. Shaded regions 


Hispanic White, Non-Hispanic Black, Hispanic/Latino, and Non-Hispanic Asian (Fig. 3). 
It has been previously shown that when parental populations stop contributing to admixture, that 
the variance of admixture proportions decreases rapidly, and when parental populations continue to 
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Fig. 3. Admixture proportion plotted against birth year. Admixture 
proportion is defined as 1 minus the maximum ancestry proportion. The 
smoothing curves are obtained using the generalized additive model method 
(gam) with a cubic spline basis implemented in R package mgcv and plotted 
using R package ggplot2. Sample size information can be found in Table 1. 
Shaded regions represent 95% confidence intervals. EHR-designated race 
categories are non-overlapping. 
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contribute to the admixture, 
the variance of admixture 
proportions increases over 
time?®. To test the null 
hypothesis that the 
observed levels of 
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increasing rates of 
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variance of œ and birth 
year. We used the software 
package MVtest and 
modeled the admixture 
proportion variance as a 
log-linear function of birth 
year and five principal 
components of ancestry 
using estimating equations. 
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2.5. Analysis of Admixture Proportion Variance Over Time 


We modeled admixture proportion mean and variance simultaneously as functions of birth year and 
covariates. Specifically, let mi be the mean and a? be the variance of the trait Y; for the i-th individual. 
We model them as 

m; = Bot BgGi + Xf- B;Xy Œ 
and 


In(o7;) = Yo + YgGi + Dear YXy. (2) 


where Gi is the birth year or variable of interest for the i-th individual and Xiz, ..., Xip are p covariates. 
In this model the variance is monotonic with respect to birth year, an assumption that holds in most 
circumstances. The parameters are estimated simultaneously. This framework allows for testing of the 
null hypothesis of no effect on mean, variance, or both for any term in the model. These correspond to 
a mean test with null Ho: fg = 0, a variance test with Ho: yg = 0, both having one degree of freedom 
(DF), and a 2-DF test with Ho: By = 0, Yg = 0. 


2.6. Model Fitting with Estimating Equations 


The parameters are estimated through the estimating equations approach, which does not require a full 
specification of the outcome distribution, but only a few constraints for the parameters of interest. These 
constraints are often written as equations, and the parameter estimates can be obtained by solving the 
equations. The asymptotic distribution for the parameter estimates can be derived”. Specifically, 
suppose the random variable has mean m; = Po + PgGi + Dei 6;Xjj, and log-variance In(o?;) = 
Yo + YgGi + pa yjXij . There are k = 2(p + 2) parameters, which can be written as a vector, 0 = 
(B, y), where B = (Bo, Bg» Bis» » Bp) and Y = Vos Yg: Var -> Yp). Let yj and x; = (1, 9, Xi1,.. , Xip)” be 
the observed values for subject i. If we had assumed normality for the outcome, the log-likelihood for 
the observation i would have been 


as, go i 2 IBM) 


for which the partial derivatives with respect to the parameters @ is a k-vector, 


Ol; yi-B' xi 
_ [ap)\_ exp (y'xi) 
RO= au} \ a fob? 1 a 9 
ay 2 | exp(y'xi) Xi 


and maximum likelihood estimates of the parameters could have been obtained by solving the k 
equations i, ¥;(@) = 0. This motivated us to use these k equations, 


1 We) =0, 6) 
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as the starting point for our estimating equations approach to obtain parameter estimates 6 = (B, f). 
If normality holds, then 6 are the maximum likelihood estimates. Note that although the estimating 
equations were motivated by the Gaussian likelihood, one can always start from these equations to 
obtain 8 and 7, whether normality holds or not, and proceed with statistical inference using the M- 
estimation theory”’. This is a major advantage for using estimating equations. The partial derivative of 
W;(0) isak x k matrix, denoted as W;(@). Using the M-estimation theory, we have 


Vn(6 — 0)? > N(0,V), (6) 
where the k x k covariance matrix V can be estimated as A~1B(A71)! with A= — “I WiC 6) 
and B = +X, Y; (PÂ. 


If our interest is on the effect of G, the asymptotic result for the joint distribution for the parameter 

estimates B, and fg is 

3 d 

vn ka - a S NCO, V), 1) 

Yg g 

where V; is the corresponding 2 x 2 submatrix of V, with diagonal values denoted as 63 and FF p> 
g 

62 


respectively. A mean test (Ho: fg = 0) can be performed by comparing VnB,with N (0, 6g), and 
g 


similarly, a variance test (Ho: Yg = 0) by comparing VnPg with N (0, ô) A 2-DF joint test (Ho: Bg = 


0,¥g = 0) can be performed by comparing n(By, %g)V2* & with a chi-squared distribution with two 
g 


degrees of freedom. MVtest software for genetic analysis of SNP data or general analysis of variables 
is freely available at https://github.com/edwards-lab/M Vtest. 


2.7. Heterozygosity Analysis 


Standardized measures of heterozygosity among the AIMs were calculated to evaluate trends in 
heterozygosity over time relative to expectations. We first estimated the expected number of 
heterozygous genotypes in an individual in the k-th subpopulation as 


Ay = Yi 2Pindix (8) 


where the sum is over all SNPs in our analysis, and pix and qik = 1-pix are the allele frequencies for the 
i-th SNP. Hardy-Weinberg equilibrium was assumed. Then for every individual j in the k-th 
subpopulation, we standardized the observed number of heterozygous genotypes, Ox, by comparing it 
with the expected number Hk: 


(Ox; — H)/H (9) 
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Standardized heterozygosity was regressed onto birth year using GAM for all study individuals, and 
the results were plotted for Non-Hispanic White, Non-Hispanic Black, Hispanic/Latino, and Non- 
Hispanic Asian. 


2.8. Analysis of Long-Range Linkage Disequilibrium 


To evaluate the presence of admixture long-range linkage disequilibrium (LRLD), pairwise linkage 
disequilibrium (LD) D’ statistics were calculated for all pairs of common (MAF > 0.05) SNPs within 
10 megabases (Mb) using Haploview software’. D’ statistics were regressed onto physical distance 
between SNPs using generalized additive models with integrated smoothness estimation for distances 
in the interval from 9-10 Mb for each birth decade (Fig. 4). 


2.9. United States Census Data Analysis 


We downloaded the 1% representative sample of individual-level response to the American Community 
Survey from the Integrated Public Use Microdata Series IPUMS) (IPUMS USA, Minneapolis, MN). 
We regressed the number of major race groups claimed by individuals onto their reported birth year 
using generalized additive models with integrate d smoothness estimation 

and frequency weights provided by IPUMS for TN, the South East Central census region, and the entire 
U.S. (Fig. 5). For individual groups, such as “White” and “Black or African American”, we plotted all 
individuals who responded affirmatively to those items; thereby, the samples for the individual race 
group plots in Fig. 5 are not independent and overlap at observation where participants claim two or 
more race groups. 
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Fig. 4. Pairwise D’ for SNPs between 9-10 Mb for all common SNPs on the exome array in all study 
individuals stratified by intervals of birth decade. Shaded regions represent 95% confidence intervals. 


2.10. Phenotype Classification 


Each individual was classified according to 1,645 phenotypes based on the International Classification 
of Disease, Ninth Revision, Clinical Modification (ICD9) Codes”. Our classification strategy includes 
all ICD9 features except for procedures. Additionally, the system is hierarchical such that disease 
subtypes are also classified, such as cardiac arrhythmias are the parent to atrial fibrillation and atrial 
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flutter. Additional phenotypes that are not represented directly in the ICD9 hierarchy are also included, 
such as inflammatory bowel disease as the parent for Crohn’s disease and ulcerative colitis. Diagnoses 
that were not possible for an individual were set to missing, such as pregnancy for biological males, or 
prostate disease for biological females. Detailed feature of all phenotype algorithms used are available 
from: http://phewascatalog.org. 


2.11. Clinical Outcomes 


For each phenotype, we regressed the binary outcome onto the standardized heterozygosity from 
equation 9 above, adjusted for birth year and the top 5 principal components of ancestry using logistic 
regression. We limited analysis to outcomes with 40 or more cases and individuals with at least 2 ICD9 
codes. We determined the threshold for statistical significance by Bonferroni correction for the number 
of analyses where the model converged. 


3. Results 


We evaluated 2,678 ancestry informative markers (AIMs) from genetic data in Vanderbilt University’s 
BioVU. These AIMs were from ExomeChip data in a cohort of 28,723 White, 4,129 Black, 550 
Hispanic/Latino, and 270 Asian individuals, based on EHR-third party race designation. The 
demographic characteristics of study individuals are presented in Table 1. 


3.1. Analysis of Temporal Trends in Genetic Admixture 


After combining our data with the 1000 Genomes as a reference group’’, we calculated principal 
components to identify patterns of ancestry in each individual. We used the ancestral classifications to 
test for temporal trends in mean and variance in admixture proportion. 
Analysis of temporal trends in genetic +s- 

admixture showed an increase in ancestral 
diversity over time. Plots of the first two 
principal components demonstrated a 
distinct pattern change in younger 
generations. To assess the trend of 
increasing admixture in younger 
individuals, we calculated the admixture 
proportion, defined as (1-Predominant 
fraction of ancestry) for each EHR- 
designated race (Fig. 3). The admixture 
proportion consistently increased with 
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proportion in those born in the early- to mid-20" century, but an increase after the 1980’s. In all age 
groups, each recorded race had a small number of individuals who plotted outside of the expected 
clusters. In later birth years, an increased number of individuals midway between the European and 
African clusters appear, creating a new cluster representing Black-White biracial children (Fig. 1). 
Stratifying these plots by EHR-designated race revealed that individuals in this ancestral cluster identify 
as both White and Black (data not shown). In addition to the clear biracial cluster apparent in the 
principal component plots, the overall proportions of ancestry across all recorded race groups exhibit 
increasing admixture in younger individuals. Additionally, a significant increase in the variance of 
admixture proportions over time was observed (variance coefficient = 0.0193 + 0.0009 [SE], p-value 
< 1.44x10°7!), indicating that there is a linear increase in variance of the admixture proportion of 0.0193 
with every birth year. This finding is consistent with recent and ongoing admixture”®. 

We detected similar patterns 
in three additional sources. First, 
we compared the rate of 
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comparing this estimate with 
the expected heterozygosity for 
the EHR-designated race of the 
individual as described in the 
methods. Analysis of the standardized heterozygosity by birth year and stated race strongly supports 
the finding of increasing ancestral diversity in younger individuals (Fig. 6). The timing of inflection for 
increased standardized heterozygosity varied between race groups, but the data indicated that ancestral 
diversity has accelerated rapidly in Asian, non-Hispanic Black, and non-Hispanic White cohorts since 
approximately 1980, while Hispanic/Latino groups have exhibited a relatively steady rate of increasing 
diversity since the 1940s. This finding reflects the increasing number of children born to biological 
parents of predominantly different ancestral backgrounds over the past few decades. 

Second, because recent admixture leads to increased LRLD, we verified patterns by estimating 
pairwise LRLD in our dataset. Using common single nucleotide polymorphisms (SNPs) (minor allele 
frequency [MAF] > 5%) from the genotyping array, we calculated pairwise D’ using Haploview”® and 
plotted against physical distance for all pairs of SNPs between 9-10 megabases of each other (Fig. 4). 
These results show a small drop in LRLD in individuals born before the 1950s, followed by significant 
steady increases in the 1950s through the 1980s and fluctuation at higher levels in the 1990s to 2010s. 
This finding is consistent with the results of the admixture proportion analysis (Fig. 3), where admixture 
proportions decreased from the 1910s to the 1940s, and then steadily increased thereafter. 


Fig. 6. Standardized heterozygosity plotted against birth year for non- 
overlapping EHR-designated race groups. Sample size information can 
be found in Table 1. Shaded regions represent 95% confidence intervals. 
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Third, we estimated changes in the number of races indicated in self-reported ancestry in the 2013 
American Communities Survey. Respondents were instructed to select Hispanic/Latino/Spanish status 
and all applicable races for each individual in the household*°. We analyzed the average number of race 
categories selected for each individual by birth year and stratified these results within race categories 
for Tennessee, the East South-Central Region which includes Tennessee, Alabama, Kentucky, and 
Mississippi (Fig. 5). The results suggest that younger individuals are more likely to indicate multiple 
races. The inflection point appears to have been earlier in the Asian race category but is demonstrated 
in all race groups by the mid-1960s in the entire U.S. sample. These data mirror the findings of our 
cohort genetic analyses. 


3.2. Changes in Health Diagnoses with Admixture 


To investigate the possible impact of increased ancestral complexity on human health and disparities, 
we evaluated the association between individual heterozygosity and disease diagnoses using a 
phenome-wide association study approach (PheWAS”). Increasing genetic admixture resulted in fewer 
diagnoses of female reproductive traits across all data (Table 2). These results remain statistically 
significant after correction for multiple tests. Phenotype codes for “disorders of menstruation and other 
abnormal bleeding from female genital tract” and “irregular menstrual cycle/bleeding” were 
significantly associated with protection by increasing heterozygosity (p-value = 7.21x10°° and 4.37x10" 
>, respectively; Table 2). Other protective findings were also gynecological in nature, including cervical 
cancer/dysplasia and abnormal Papanicolaou smear results. Significant phenotypes in adults were 
predominantly detected for biological females. Outside of genitourinary findings, other nominally 
significant associations (Bonferroni significant < p < 0.05) show increased risk with genetic admixture 
and include atopic dermatitis, AV Block, obstructive asthma, and Sicca syndrome. 


Table 2. Results from the phenome-wide association study of heterozygosity and clinical outcomes for 
full sample. 


OR (95% Confidence 


PheCode Phenotype P-value Taienval) 
Disorders of menstruation and other abnormal 6 

629 bleeding from female genital tract TARIR Hed Me 087) 

626.1 Irregular menstrual cycle/bleeding 4.37x10° 0.37 (0.23 — 0.60) 

939 Atopic/contact dermatitis due to other or 86x10" 1.82 (1.32 — 2.52) 
unspecified 


180 Cervical cancer and dysplasia 4.02x10° 0.19 (0.08 — 0.48) 

792.1 Papanicolaou smear of cervix or vagina with 5 36x10" 0.21 (0.09 — 0.51) 
atypical squamous cells 

180.3 Cervical intraepithelial neoplasia [CIN] 6.24x10“ 0.15 (0.05 — 0.45) 
[Cervical dysplasia] 

426.2 Atrioventricular [AV] block 7.62x107 2.97 (1.58 — 5.61) 

495.11 Chronic obstructive asthma with exacerbation 8.13x10" 4.63 (1.89 — 11.34) 
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4. Discussion 


Mitigating racial disparities in health is a significant challenge for precision medicine. Some of these 
population-level health differences may be caused by phenotypic variability associated with ancestral 
genetic backgrounds. Admixture introduces additional complexity to genetic studies of health 
disparities and the effects of historical, ongoing, and increasing admixture on population-level health 
are not well understood. This study evaluates the level of admixture over time and the relationship 
between ancestral diversity and population health from a clinical perspective. 

In our Southeastern United States cohort of 35,842 individuals, we found that for individuals with 
an EHR race designation of White, the mean proportion of European ancestry decreased from 98% to 
92% after the 1990s as the proportion of African ancestry increased to 6%. For individuals designated 
as Black in the EHR, the mean African proportion decreased by 3% after the 1990s. The European 
ancestry proportion in the EHR-designated Hispanic/Latino group decreased by 15% after the 1980s 
(data not shown). Comparing these changes to historical socio-cultural shifts in our cohort’s geographic 
region provides context for these results. In the Southeastern U.S., laws and policies enforced 
segregation of populations of European and African ancestry. Consistent with these socio-cultural 
boundaries, there is little change in admixture through the 1960s. Additionally, despite legal rulings 
and socio-cultural transformation, there remained a very slow increase in admixture and heterozygosity 
for an additional 20-30 years, followed by a sharp increase over the next few decades. 

Our results, qualitatively mirrored in the 2013 American Communities Survey, show that younger 
individuals are more likely to have greater ancestral diversity than older age groups. The results of our 
long-range disequilibrium (LRLD) analyses support this notion, with LRLD consistently increasing for 
individuals born between 1990 and 2010. It is important to note, however, that other possible sources 
of LRLD (e.g. drift, epistatic selection) cannot be necessarily ruled out, although they seem unlikely 
given the recent nature of admixture and formation of LRLD. This increase in ancestral heterogeneity 
of the younger population may also lend itself to more powerful admixture mapping projects in 
populations not traditionally considered for these types of studies. 

Further, we show that changes in population genetic parameters have important consequences for 
individual and population-level health. Several statistically significant (p < 5x105) associations of 
genetic diversity with adult female genitourinary diagnosis codes (e.g. irregular menstrual 
cycle/bleeding, cervical cancer/dysplasia) were observed. These novel findings linking admixture to 
protection from menstruation and gynecological abnormalities suggest that ancestral diversity may 
decrease risk of disorders that could affect reproduction. Further, the changes in reproductive diagnoses 
were detected predominantly for biological females, suggesting a potential sex-specific population 
clinical response to changes in admixture. However, the sex-specific response we detected could also 
be a result of differences in treatment for reproductive health. For example, male reproductive traits, 
such as sperm quality, may not be routinely checked and reported as with female reproductive 
parameters. 
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Other patterns emerge when considering nominally significant (p < 0.05) PheWAS results. Several 
of these diagnoses increase risk with increasing genetic diversity. Importantly, each of these diseases 
have at least suggestive links to autoimmune dysregulation, including atopic dermatitis*!, AV block***?, 
asthma*+35, and Sicca/Sjégren syndrome**. These patterns suggest a connection between increased 
heterozygosity and increased activity in the immune system. Because our results show continued 
increase in genetic admixture over time, it is possible that there will be increases in prevalence of these 
types of diseases with time as well. Future research should address these immunity-disease relationships 
with respect to admixture to determine the validity and consistency of these patterns. 

The present study has several limitations that warrant consideration. First, the use of EHR data may 
have high levels of missingness and can introduce inherent selection bias due to patients seeking care 
at tertiary care centers. Furthermore, given the constraints imposed by our limited sample size and the 
unavailability of comprehensive reference data for Hispanic/Latino and Native American populations, 
we were unable to estimate Native American ancestry in this study. Therefore, to provide more robust 
insights into individuals who identify as Hispanic/Latino and/or Native American, it is necessary to 
independently validate these results using larger datasets with more diverse reference data. 

The concept of race was utilized in this study to reflect demographic dynamics in our cohort’s 
geographic region and to investigate changes to admixture and heterozygosity within these groups. 
Although the concept of race is a construct with social underpinnings and has limited biological 
meaning*’, race is often captured in the clinical setting and is the basis for some clinical decision 
making. It is important to consider the changing implications of classifying individuals by race given 
the trend of increasing genetic diversity observed in this work and others**. As prevalence of many 
diseases and some drug efficacies vary by race, understanding race-associated factors in patients with 
complex ancestries may be increasingly important for effective delivery of precision medical care. 
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Abstract 


There is a desire in research to move away from the concept of race as a clinical factor because it 
is a societal construct used as an imprecise proxy for geographic ancestry. In this study, we leverage 
the biobank from Vanderbilt University Medical Center, BioVU, to investigate relationships 
between genetic ancestry proportion and the clinical phenome. For all samples in BioVU, we 
calculated six ancestry proportions based on 1000 Genomes references: eastern African (EAFR), 
western African (WAFR), northern European (NEUR), southern European (SEUR), eastern Asian 
(EAS), and southern Asian (SAS). From PheWAS, we found phecode categories significantly 
enriched neoplasms for EAFR, WAFR, and SEUR, and pregnancy complication in SEUR, NEUR, 
SAS, and EAS (p < 0.003). We then selected phenotypes hypertension (HTN) and atrial fibrillation 
(AFib) to further investigate the relationships between these phenotypes and EAFR, WAFR, SEUR, 
and NEUR using logistic regression modeling and non-linear restricted cubic spline modeling 
(RCS). For EAS and SAS, we chose renal failure (RF) for further modeling. The relationships 
between HTN and AFib and the ancestries EAFR, WAFR, and SEUR were best fit by the linear 
model (beta p < 1x10* for all) while the relationships with NEUR were best fit with RCS (HTN 
ANOVA p = 0.001, AFib ANOVA p< 1x10%). For RF, the relationship with SAS was best fit with 
a linear model (beta p < 1x10“) while RCS model was a better fit for EAS (ANOVA p < 1x10). 
In this study, we identify relationships between genetic ancestry and phenotypes that are best fit 
with non-linear modeling techniques. The assumption of linearity for regression modeling is 
integral for proper fitting of a model and there is no knowing a priori to modeling if the relationship 
is truly linear. 


Keywords: genetic ancestry, health disparities, PheWAS, linear modeling 
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1. Introduction 


Race is a social construct that is an imprecise way to classify groups prevalence of heritable risk 
factors, therefore there is a growing consensus in clinical and population research to move away 
from the use of race in the context of disease risk. Some racial disparities in health condition risks 
documented in the epidemiological literature may be due to non-biological differences between 
racial groups.! Geographic or genetic ancestry has been proposed as a more precise approach to 
capture differences in disease etiology that may be due to acquired biological differences in human 
populations. We hypothesize that when populations have evolutionarily adapted to a specific 
environment encounter different circumstances, disease risks can be influenced, and disparities can 
arise when compared to a population that is in evolutionary equilibrium with that environment. If 
this hypothesis is true, then this relationship would be detectable as an association between 
genetically inferred proportions of ancestry and disease risk. Improved understanding of how 
different geographic ancestries are responding to modern environments, nutrition, and behavioral 
lifestyles could help us understand genetic causes of diseases and improve healthcare. 

Current approaches to precision medicine focus on a patient’s clinical history and are often 
combined with known genetic risk factors, such as causal monogenic variants and more recently 
polygenic risk scores. Over the last several decades, race has been incorporated into clinical risk 
prediction models for several conditions when racial differences have been observed in disease 
prevalence, particularly for estimating drug responses. Race has also been used for medical tools 
such as calibrating eGFR measures for assessment of kidney disease risk. However, multiple studies 
have shown that administratively determined race or self-reported race are imprecise estimates of 
an individual’s genetic ancestry, and thus use of race in modeling is a flawed approach.” Imprecise 
racial/ancestral identification may lead to lack of response to a personalized treatment plan that 
depends on a strong assumption of race capturing biological differences. Furthermore, recent work 
by several groups have shown that for some diseases genetic ancestry (global ancestry)* may directly 
interact with a patient’s clinical characteristics to modify risk for disease and that this interaction 
varies at specific points in their genome (local ancestry).*7 

Within this study we leverage the rich phenotypic information available from Vanderbilt 
University Medical Center’s (VUMC) biobank, BioVU, to evaluate the relationship between global 
geographic ancestry and the clinical phenome using phenome wide association study (PheWAS). 
From PheWAS results, we sought to identify enriched phenotype categories for ancestry groups and 
selected phenotypes within them for additional modeling. Selected phenotypes were then modeled 
using logistic regression and restricted cubic splines (RCS) to further investigate the relationship 
between phenotype and ancestry group. Studies usually make the strong assumption that the 
relationship between genetic ancestry and disease risk is linear. We chose to explore if fitting a non- 
linear model better described the relationship. 


2. Methods 


2.1. Study Population 


The BioVU DNA Repository is a de-identified database of electronic health records (EHR) that are 
linked to patient DNA samples at VUMC. A detailed description of the database and how it is 
maintained has been published elsewhere. BioVU participant DNA samples were genotyped on a 
custom Illumina Multi-Ethnic Genotyping Array (MEGA-ex; Illumina Inc., San Diego, CA, USA). 
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Quality control included excluding samples or variants with missingness rates above 2%, excluded 
if consent had been revoked, sample was duplicated, or failed sex concordance checks. Imputation 
was performed on the Michigan Imputation Server v1.2.410 using Minimac4? and the Haplotype 
Reference Consortium (HRC) panel v1.1.'° 


2.2. Ancestry Estimations of BioVU Participants 


Estimation of ancestry proportion for BioVU participants based upon 1000 Genomes reference data 
has been described elsewhere.'! In brief, the 1000 Genome populations were grouped into six super- 
population by geographic ancestry of east African (EAFR), west African (WAFR), southern 
European (SEUR), northern European (NEUR), east Asian (EAS), and south Asian (SAS) as 
described in Keaton, et. al 2021!? using ADMIXTURE.! The six ancestry groups were projected 
onto BioVU to determine proportion of the six ancestries for all samples. Ancestry proportion of 
samples in the cohort was visualized by plotting subjects along the x-axis and their corresponding 
stacked ancestry proportions on the y-axis. Subjects were sorted by increasing SEUR ancestry. 


2.3. Ancestry Phenome Wide Association Study 


We conducted hypothesis-free PheWAS analyses of evaluating phecodes in the phenome with each 
of the six ancestries. Each ancestry was used as the main predictor in separate analysis, adjusted for 
age, sex, and body mass index (BMI). PheWAS was performed with the R package ‘PheWAS’ 
version 2.'4 1,875 clinical disease phenotypes called phecodes from Phecode Map 1.2 were 
evaluated.'> A p-value of 2.7x10° was the threshold for significance to correct for multiple testing 
(Bonferroni correction of 0.05/1,875 phecodes tested). 


2.3.1. Hypergeometric Testing of Enrichment 


Post PheWAS, phecodes were mapped to phenotypes and the phenotypes were grouped into sixteen 
categories from the phecodes map. We then conducted hypergeometric testing for enrichment for 
each phecode category within each ancestry PheWAS result. The hypergeometric distribution 
function HYPGEOM.DIST from excel was used to calculate fold change and significance level for 
each category. Threshold for significance was 0.003 to correct for multiple testing (Bonferroni 
correction of 0.05/16 phecode groups tested). Hypergeometric testing results were visualized by 
plotting the -log(p-value) of enrichment for each category as a function of fold change. Phecode 
categories pregnancy complication and neoplasms were visualized by graphing each phecode in the 
categories by -log(p-value) as a function of effect size. Plots were made with R 4.2.2.'° 


2.3.2. Selection of Phecodes for Modeling 


In PheWAS results, we looked for phecodes that differed in relationship between EAS and SAS, 
and between EAFR, WAFR and NEUR, SEUR. Renal failure (RF) was selected for further modeling 
in EAS and SAS. The pre-made phecode categories do not always capture all relevant codes to a 
certain system. To focus more on the cardiac system, we extracted phenotypes using the key terms 
“hypertens”, “heart”, “card”, “valv”, “fibril”, “coronary”, and “angina.” After manual review, we 
excluded codes pertaining to “poisoning by agents primarily affecting the cardiovascular system” 
and “heartburn”. Selected cardiac phecodes were visualized by plotting the -log(p-value) of the 
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phecodes as a function of effect size using R 4.2.2. From this cardiac systems plot, we selected 
phenotypes hypertension (HTN) and atrial fibrillation (AFib) for further modeling with EAFR, 
WAER, NEUR, and SEUR. 


2.4. Logistic Regression Modeling of Select Phecodes 


Selected phenotypes were modeled as logistic regression and RCS using the R package “rms” 
version 6.2-0.'’ Each ancestry was used as the main predictor in separate models. Phenotypes were 
modeled as a function of ancestry proportion (ANC) using (Eq. 1) for logistic regression. 


PLY = 1|X} = Bo + PancXanc + PageXage + PsexPsex + BeuiX emi (1) 


Odds ratios (OR) and confidence intervals (CI) calculated for each ancestry from logistic 
regression are given for a 10% increase in ancestry proportion. Phenotypes were modeled as a 
function of ancestry proportion using (Eq. 2) for RCS with three knots (a,b,c). 


P{Y = 1|X} = Bo + BancXanc + PageXage + PsexPsex T BeurXemi + (2) 
PalXanc + a)? + By (Kane +b)? + Be(Xane +o? 


Knot positions were determined by default “rms” placement. Odds ratios for RCS were 
calculated using integrated “rms” functions for a quartile increase in ancestry from the 25" to 50% 
percentile and for the 50" to 75" percentile. Significance threshold for ANOVA tests of significant 
model improvement with RCS over linear was 0.004 (Bonferroni correction of 0.05/12 [six 
ancestries * two models]). 


3. Results 


3.1. Genetic Ancestry of BioVU Participants 


There were 71,140 participants from BioVU, 59.06% of which were female, the average age was 
54.09 (SD = 18.15), and the average BMI was 29.03 (SD = 7.27). (Table 1) Ancestry proportions 
for all individuals in BioVU are visualized in Figure 1. From the six ancestry proportions calculated, 
the ancestry group SEUR represented the largest proportion of genetic ancestry with a population 
average of 60.9%, followed by NEUR with 22.4%, WAFR with 6.41%, EAFR with 7.07%, SAS 
with 1.40%, and EAS with 1.76%. (Table 1) 


3.2. PheWAS Summarized with Hypergeometric Testing 


There were 404 phecodes significantly associated with EAFR, 396 with WAFR, 414 with SEUR, 
150 with NEUR, 68 with SAS, and 74 with EAS. (Table 2) Hypergeometric testing of phecode 
categories identified enriched and de-enriched categories of phecodes. (Figure 2A) EAFR, WAFR 
and SEUR were de-enriched for ‘injuries and poisonings’ and ‘musculoskeletal’ and enriched for 
‘neoplasms.’ 
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Table 1. Population characteristics of the BioVU cohort. 


Mean (SD) or N (%) 


Age (years) 54.08 (18.15) 
BMI (kg/m?) 29.03 (7.27) 
Sex (Females) 42016 (59%) 

NEUR 22.43 (9.72) 
SEUR 60.93 (23.29) 
EAFR 7.07 (15.4) 
WAFR 6.41 (14.09) 
EAS 1.76 (9.6) 
SAS 1.4 (6.05) 


Kg: kilogram; m: meters 


Phecodes significant within neoplasms showed opposite directions of effect for NEUR and SEUR 
groups versus WAFR and EAFR groups. (Figure 2B) Codes representing skin cancer and other skin 
neoplasms increased in odds with increasing NEUR and SEUR ancestry proportion but decreased 
in odds with increasing WAFR and EAFR ancestry proportion. Conversely uterine leiomyoma had 
increased odds with increased EAFR and WAFR ancestry proportion and decreased odds with 
increased SEUR and NEUR ancestry proportion. (Figure 2B) EAFR was additionally enriched for 
‘genitourinary.’ ‘Pregnancy complications’ was enriched in NEUR, SEUR, EAS, and SAS. When 
investigated further, it was revealed the significant phecodes in the category were almost all in the 
decreased direction for NEUR and SEUR and increased direction for EAFR, WAFR, EAS, and 
SAS. (Figure 2C) 


Proportion 
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Figure 1. Structure plot of the genetic ancestry make up of BioVU Participants. Subjects are aligned 
on the x-axis by proportion of SEUR. NEUR: northern European; SEUR: southern European; 
EAFR: eastern African; WAFR: western African; EAS: eastern Asian; SAS: southern Asian 
ancestry. 
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3.3. Modeling Ancestry Proportion 


We identified 103 phecodes that included cardiac keyword/phrases. The most significant phecodes 
were phecodes representing hypertension and its consequences. Increasing EAFR and WAFR 
ancestry proportion increases odds for the phecodes and increasing SEUR and NEUR ancestry 
proportion decreases odds for the conditions. Phecodes involving atrial fibrillation and related codes 
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Figure 2. Volcano plots of fold change from hypergeometric testing or ancestry coefficient from PheWAS 
plotted against the negative log transformed p-value for A) Phecode categories B) neoplasm and C) 
pregnancy complications. Created with BioRender.com 
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Figure 3. Volcano plot of selected phecodes related to the cardiac system. Coefficient of phecode from 
PheWAS is on the x-axis and the y-axis is negative log transformation of p-value. Created with 
BioRender.com 
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were significantly associated with the same ancestries, but in opposite directions: increasing NEUR 
and SEUR increase odds while EAFR and WAFR decrease odds. (Figure 3) 


Table 2. Significant results from hypergeometric testing of phecode categories for each ancestry. 
Positive values indicated an enrichment of significant phecodes within that category while negative 
values indicate de-enrichment. Significance level is 0.003. 


Ancestry ~ N Significant Codes Fold Change P-value 
EAFR ~ 404 

genitourinary 1.39 0.003 

injuries & poisonings -2.4 2.24x10* 

musculoskeletal -2.05 3.89x10* 

neoplasms 1.51 0.001 
WAFR ~ 396 

injuries & poisonings 2.35 3.35x10* 

musculoskeletal -2.17 2.31x10* 

neoplasms 1.57 4.13x10* 
SEUR ~ 414 

injuries & poisonings -2.05 0.001 

musculoskeletal -1.96 5.77x10* 

neoplasms 1.58 2.83x10* 

pregnancy complications 2 5.06x10* 
NEUR ~ 150 

infectious diseases 1.87 6.38x107 

musculoskeletal -3.83 4.17x10* 

pregnancy complications 2.2 1.27x107 
SAS ~ 68 

pregnancy complications 7.32 1.97x10°% 
EAS ~ 74 

digestive 2.34 7.63x107 

mental disorders 3.56 1.57x107 

pregnancy complications 6.16 5.57x107 


We then investigated phecodes 401 ‘hypertension’ (HTN) and 427.2 ‘atrial fibrillation’ (AFib) 
with modeling in EAFR, WAFR, NEUR, and SEUR. (Figure 4A) When modeled linearly, each 
ancestry was associated with HTN and AFib (p < 0.003). (Table 3) When HTN and AFib were 
modeled using RCS, the ANOVA test revealed adding the complexity of non-linearity did 
significantly improve the model for NEUR (p = 0.001, p < 1x10* respectively) but not for EAFR, 
WAER, and NEUR (p > 0.003). (Figure 4) Increasing ancestry proportion by 10% in the linear 
model gave an OR for HTN of 2.29 (95% CI: 2.11 - 2.48) for EAFR, 2.73 (95% CI: 2.48 - 3.01) for 
WAER, 0.27 (95% CI: 0.22 - 0.33) for NEUR, and 0.73 (95% CI: 0.70 - 0.75) for SEUR, visualized 
in the top row panels of Figure 4A. For AFib, a 10% increase in ancestry proportion yields ORs of 
0.58 (95% CI: 0.49 - 0.68) for EAFR, 0.53 (95% CI: 0.44 - 0.63) for WAFR, 4.39 (95% CI: 3.07 - 
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6.27) for NEUR, and 1.31 (95% CI: 1.22 - 1.40) for SEUR when modeled linearly and is visualized 
in the third row of panels in Figure 4A. Only NEUR had significant ANOVA p-values for the RCS 
models in both HTN and AFib. Increasing NEUR ancestry in RCS modeling of HTN from 25" to 
50" percentile in NEUR ancestry proportion gave an OR of 0.96 (95% CI: 0.94 - 0.98) and the 50" 
to 75" percentile increase gave an OR of 0.99 (95% CI: 0.98 - 1.01). In RCS modeling of AFib, 
increase from 25" to 50" percentile in NEUR ancestry yielded an OR of 1.02 (95% CI: 0.99 - 1.06) 
and the 50" to 75" percentile increase yielded an OR of 0.98 (95% CI: 0.96 - 1.01). (Table 3) RCS 
models for HTN and AFib are visualized in the second and fourth row of panels in Figure 4A, 


respectively. 


Table 3. Results of logistic regression and restricted cubic spline modeling for hypertension and 
atrial fibrillation in northern European, southern European, west African, and east African ancestry; 
and renal failure in eastern Asian and southern Asian ancestry. 


Logistic Regression 


OR* (95% CD) 


P-value 


Restricted Cubic Spline 


OR (95% CI) 


OR ¢ (95% CI) 


ANOVA 
P-value 


Atrial Fibrillation 


SEUR 
NEUR 
EAFR 
WAFR 
Hypertension 
SEUR 
NEUR 
EAFR 
WAFR 
Renal Failure 
EAS 
SAS 


1.31 (1.22-1.40) 
4.39 (3.07-6.27) 
0.58 (0.49-0.68) 
0.53 (0.44-0.63) 


0.72 (0.70-0.75) 
0.27 (0.22-0.33) 
2.29 (2.11-2.48) 
2.73 (2.48-3.01) 


0.96 (0.73-1.26) 
0.15 (0.06-0.37) 


<1x104 
<1x104 
<1x104 
<1x104 


<1x104 
<1x104 
<1x104 
<1x104 


0.78 
<1x104 


1.00 (0.98-1.03) 
1.02 (0.99-1.06) 
0.99 (0.99-1.00) 
0.99 (0.99-1.00) 


0.98 (0.96-0.99) 
0.96 (0.94-0.98) 


1.00 (0.999-1.003) 
1.00 (0.998-1.002) 


1.09 (1.08-1.11) 
1.00 (0.98-1.03) 


1.00 (0.98-1.02) 
0.98 (0.96-1.01) 
0.97 (0.96-0.99) 
0.98 (0.96-0.99) 


0.99 (0.98-1.00) 
0.99 (0.98-1.01) 
1.003 (.996-1.01) 
1.00 (0.99-1.01) 


1.18 (1.15-1.21) 
1.00 (0.98-1.03) 


0.06 
<1x10* 
0.01 
0.02 


0.25 
0.001 

0.30 

0.11 


<1x104 
0.41 


*Odds ratio given for 10% increase of ancestry proportion 


+Odds ratio given for 25" to 50" percentile of ancestry proportion 


t Odds ratio given for 50" to 75™ percentile of ancestry proportion 


In PheWAS results, phecode 585 ‘renal failure’ showed different relationships with EAS and 
SAS ancestry proportion; RF was significantly associated with SAS, but not for EAS. (Table 3) 
When modeled linearly, SAS ancestry proportion was significantly associated with RF (p < 1x10“) 
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Figure 4. Linear modeling using logistic regression and restricted cubic spline (RCS) modeling of select 
phenotypes. A) Hypertension and atrial fibrillation risk models for SEUR, NEUR, EAFR, WAFR. B) 
Renal failure models for EAS and SAS. Log odds of outcome was graphed as a function of ancestry 
proportion adjusted for age, sex, and BMI. * = significant model. Created with BioRender.com. 


but adding non-linear complexity did not significantly improve the model (p = 0.41). (Figure 4B) 
EAS was not significantly associated with RF when modeled linearly (p = 0.78). Modeling with the 
non-linear RCS revealed a significant relationship between RF and EAS ancestry proportion (p < 
1x104). (Figure 4B) For SAS, a 10 % increase in ancestry proportion had an OR of 0.15 (95% CI: 


397 


Pacific Symposium on Biocomputing 2024 


0.06 - 0.37) when modeled linearly. In RCS modeling, increasing from the 25" to 50" percentile of 
EAS ancestry proportion increases odds for RF by 1.09 (95% CI: 1.08 — 1.11) and increasing from 
the 50" to 75" percentile increases odds by 1.18 (95% CI: 1.15 — 1.21). (Table 3) 


4. Discussion 


We present an evaluation of the relationships between genetic ancestry proportions and the clinical 
phenome of the BioVU cohort. Our analyses revealed significantly enriched and de-enriched 
phecode categories for each ancestry group studied. We further evaluated the relationship between 
genetic ancestry and risk for HTN, AFib, and RF using linear and non-linear modeling methods. 


4.1. Relationships Between Ancestry and the Clinical Phenome 


Phecode categories that were de-enriched for PheWAS associations were ‘injuries and poisonings’ 
and ‘musculoskeletal’ for EAFR, WAFR, SEUR and EAFR, WAFR, SEUR, NEUR respectively. 
Both categories represent codes that are not conditions typically considered heritable. ‘Injuries and 
poisonings’ category comprises codes related to non-pathologic fractures, trauma injuries, and 
poisonings, all events caused by environment. Phecodes in musculoskeletal involve injuries or 
deformities of joints, bones, and muscles acquired from usage of the body. One specific phenotype 
to mention in this category is osteoporosis, where increasing NEUR and SEUR ancestry increased 
risk for codes relating to osteoporosis (phecodes 743, 743.1, 743.11) and spine curvature (737, 
737.3), while the same codes have a protective effect with increasing EAFR and WAFR ancestry. 
Studies have shown increased bone mineral density and lower rates of osteoporosis associated in 
Black women compared to non-Hispanic White women.'* Our genetic ancestry study findings 
support this previously observed epidemiological relationship. 

In the ‘neoplasm’ category, many of the phecodes were in the risk direction for SEUR and 
NEUR ancestries and in the protective direction for WAFR and EAFR. The top significant neoplasm 
codes refer to skin cancer and other neoplasms of skin. The biological relationship between 
geographic ancestry and skin cancer has been well documented; populations in equatorial regions 
produce more melanin to protect against DNA damage from UV radiation while populations out 
towards the poles have evolved to produce less melanin due to less UV exposure.'?”° It is possible 
that individuals of European genetic ancestry migrated away from the environments where they 
adapted to be at equilibrium and are now in new environments they are at disequilibrium with.”! 

One of the few exceptions to the pattern seen in ‘neoplasms’ were the phecodes 218 and 218.1, 
representing “uterine leiomyoma’ (or fibroids). Increasing EAFR and WAFR ancestries increases 
odds for fibroids while increasing SEUR and NEUR ancestries was protective against fibroids. This 
relationship pattern is consistent with previous epidemiology literature. Black women have been 
found to develop fibroids at younger ages, were more likely to have a clinical diagnosis, and to have 
had a hysterectomy from fibroids.” The overall odds of developing fibroids by age 50 were 2.9 
times higher among Black women compared to White women.”? Due to the significant racial 
disparities that exist for fibroids, it has been hypothesized that there is a genetic component to the 
condition, with a heritability estimate of ~30%.”3 Previous genetic studies have found African 
genetic ancestry proportion to be associated with fibroids diagnosis!” and multiple fibroids.” Our 
study further supports the theory that African genetic ancestry may explain a portion of the risk for 
fibroids. 


398 


Pacific Symposium on Biocomputing 2024 


The pregnancy complication category was significantly enriched in NEUR, SEUR, EAS, and 
SAS. Within the category, significantly associated phecodes were all in the protective direction for 
NEUR and SEUR and in the risk direction for EAS and SAS. Racial disparities in maternal health 
outcomes have been well documented for White and Black women, with Black women having 
significantly higher adverse maternal outcomes compared to White women.”> There have been many 
external factors posited for why Black women in US experience pregnancy complications and 
maternal mortality at much higher rates.*° Trends in pregnancy complications for Asian women are 
less well-documented. A study of fertility treatment outcomes in Asian American women found 
decreased success of treatment in the forms of lower pregnancy rates and live births.”” Using genetic 
ancestry proportions as a study variable may help to fill in some of the missing epidemiological gaps 
that still are pervasive in historically under-represented racial groups. 


4.2. Modeling Ancestry Proportion Linearly and non-Linearly 


From the phecodes grouped into the cardiac category, we saw a striking pattern. Several phecodes 
representing HTN and hypertensive disorders and consequences were found to be at increased risk 
in EAFR and WAFR and decreased risk in NEUR and SEUR. An opposite trend was seen for 
phecodes representing AFib and related codes; SEUR and NEUR were at increased risk while EAFR 
and WAFR were at decreased risk. This pattern follows what has been reported in literature.?8-30 
Our study shows the trends we see for HTN and AFib are due in part to genetic ancestry. 

While plenty of studies have focused on external causes and contributions to the higher 
prevalence of HTN in Black individuals,*! it is known to be heritable.** A small (N = 998), previous 
study evaluated the relationship between African genetic ancestry proportion in self-identified Black 
individuals and hypertension and found the highest quartile of African genetic ancestry proportion 
had 8% higher prevalence than the lowest quartile.’ Marden et al. used African genetic ancestry 
proportion to tease apart the contributions of genetics and socioeconomic status to HTN prevalence 
and found that their accounted socioeconomic factors only explained one-third of the difference in 
prevalence measured.*? We as well sought to use genetic ancestry to determine its contribution to 
HTN disease risk as it helps to avoid confounders. The previous study and ours have both found 
African genetic ancestry to be associated with HTN risk and prevalence. 

Within our evaluation of RF, we found linear modeling to be sufficient to model the relationship 
with SAS ancestry. EAS ancestry was not significantly associated with RF in PheWAS or when 
modeled individually linearly. Allowing for flexibility with non-linear RCS modeling revealed a 
relationship between EAS ancestry and RF. Only with the RCS model were we able to detect an OR 
of 1.18 (95% CI: 1.15-1.21) with an increase from 50" to 75" percentile of EAS ancestry. EAS was 
the ancestry group with the most skewed data density of the six groups, with the 3rd quartile ancestry 
proportion value being just 0.45% and one of our smallest sub-sample sizes with 760 self-identified 
individuals. The RCS model may have performed better due to being able to compensate for the 
skewness of data. Many wide-scale analyses perform only linear modeling which may not detect 
relationships, as seen for RF in EAS ancestry PheWAS. The risk trends for EAS and RF from RCS 
modeling have been reported previously in literature. Higher rates of end stage renal disease and 
increased risk of projected kidney failure have been observed in Far East, Southeast Asia, and Indian 
populations as compared to White populations.*4*> The linear model for SAS and RCS model for 
EAS recapitulate these findings. Assuming linear relationships between genetics and disease may 
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cause associations to be missed, highlighting the need to consider non-linear modeling methods such 
as RCS. 


4.3. Considerations and Strengths 


While our study found some phenotype relationships that were consistent with epidemiology studies 
based on self-identified race, we did not evaluate the potential contribution of proportions of 
admixture on disease risk. On average, people had an admixture proportion of 0.33 (+/- 0.12) 
amongst the 6 super populations we determined with the more granular division of Southern and 
Northern European, Eastern and Western African, and East and South Asian. Within our cohort, 
those who self-identified as non-Hispanic Black had on average 78.6% African ancestry (EAFR + 
WAEFR), 19.4% European ancestry (SEUR + NEUR), and 1.99% Asian ancestry (EAS + SAS). 
Those who self-identified as non-Hispanic White had on average 6.85% African ancestry, 98.0% 
European ancestry, and 1.26% Asian ancestry. Our study was limited in its ability to test more 
admixed populations where these methods may be more useful in identifying phenotypes associated 
with genetic ancestry. 

We only used one ancestry as a predictor variable per model. Different geographic ancestries 
may interact differently, and this study does not account for various combinations of genetic 
ancestry proportions. Further investigation is needed to understand how the different genetic 
ancestries interact with each other and modify risk. A potential limitation of our study is the way in 
which some phenotypes may be diagnosed. Some phenotypes such as chronic kidney disease rely 
on algorithms that use self-reported race as a criterion to determine diagnosis, for example estimated 
glomerular filtration rate (eGFR) algorithms have historically used race as a coefficient in the 
equation for measuring eGFR levels which may bias diagnoses across racial and ethnic groups.*° 

In this study, we identified hundreds of traits in the clinical phenome that are associated with 
ancestry proportion. From our selected studies of enriched phecode categories and modeling of HTN 
and AFib, we observed many relationships between ancestry and phecodes that matched the 
epidemiology literature between self-identified race and traits. We used RCS to model a significant 
relationship between RF and EAS ancestry, one that was not originally identified from linear 
modeling. We highlighted a few phenotypes in this paper as an exploratory investigation into the 
potential of RCS modeling for ancestry proportion and disease risk. 

Most traditional epidemiology literature notes the shortcomings of their studies revolve around 
using the societal construct of race, a lack of healthcare access for underrepresented groups and low- 
income individuals, and external environmental factors. Adjusting for race to better account for 
these factors like socioeconomic status or systemic discrimination in addition to using genetic 
ancestry proportion, which capture heritable contributions, may provide more comprehensive 
models. Future work controlling for genetic ancestry that demonstrates significant associations with 
race would highlight systemic factors affecting outcomes that are not captured by ancestry alone. In 
addition to utilizing genetic ancestry, we show how alternative modeling methods can be useful 
especially in a case of an underrepresented ancestry group where linear models may not be as 
successful to describe more complicated associations. Our study displays how genetic ancestry can 
be leveraged in furtherance of studying disease risk where traditional epidemiological studies have 
fallen short. 
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Precision medicine models often perform better for populations of European ancestry due 
to the over-representation of this group in the genomic datasets and large-scale biobanks 
from which the models are constructed. As a result, prediction models may misrepresent or 
provide less accurate treatment recommendations for underrepresented populations, con- 
tributing to health disparities. This study introduces an adaptable machine learning toolkit 
that integrates multiple existing methodologies and novel techniques to enhance the predic- 
tion accuracy for underrepresented populations in genomic datasets. By leveraging machine 
learning techniques, including gradient boosting and automated methods, coupled with 
novel population-conditional re-sampling techniques, our method significantly improves the 
phenotypic prediction from single nucleotide polymorphism (SNP) data for diverse popu- 
lations. We evaluate our approach using the UK Biobank, which is composed primarily of 
British individuals with European ancestry, and a minority representation of groups with 
Asian and African ancestry. Performance metrics demonstrate substantial improvements in 
phenotype prediction for underrepresented groups, achieving prediction accuracy compa- 
rable to that of the majority group. This approach represents a significant step towards 
improving prediction accuracy amidst current dataset diversity challenges. By integrating 
a tailored pipeline, our approach fosters more equitable validity and utility of statistical 
genetics methods, paving the way for more inclusive models and outcomes. 


Keywords: Genetics; Precision Medicine; Machine Learning; Phenotype Prediction; Bioin- 
formatics. 


1. Introduction 


In recent years, genome-wide association studies (GWAS) have provided many insights into 
the genetic basis of complex traits and diseases. However, these findings predominantly ben- 
efit populations of European descent due to their over-representation in genomic datasets. 
Individuals with Asian, African, and other ancestries only represent a small fraction of the 
available datasets.! Although individuals of European descent constitute ~79% of GWAS par- 
ticipants,” they account for less than a quarter of the global population. This disproportionate 
representation creates a limitation in precision medicine, because statistical models built to 
infer disease risks or health-related traits can perform poorly for individuals from populations 


© 2023 The Authors. Open Access chapter published by World Scientific Publishing Company and 
distributed under the terms of the Creative Commons Attribution Non-Commercial (CC BY-NC) 
4.0 License. 


404 


Pacific Symposium on Biocomputing 2024 


that were underrepresented when creating the model, exacerbating health disparities. Despite 
initiatives to include a broader range of populations in genetic studies and biobanks,*” the 
proportion of non-European individuals in GWAS studies has stagnated in the last decade.?* 
This imbalance has a direct impact on Polygenic Risk Score (PRS) prediction for underrep- 
resented populations,? making clinical applications based on PRS significantly more accurate 
for individuals of European descent, but less effective for other populations.'!°!? This dis- 
parity has raised ethical concerns within the scientific and clinical community.!?*13 While 
most studies only use European individuals and European-derived statistics to build predic- 
tive models,®11-14 recent studies have explored including non-European training data in PRS 
construction, but this has only proven effective when a large number of training samples of 
non-European target populations are available.'° 

Phenotype prediction utilizes genetic information to forecast an organism’s observable 
characteristics, known as phenotypes. These traits can range from disease susceptibility to 
other attributes, enabling personalized treatments based on individual genetic profiles. Ma- 
chine learning (ML) and deep learning (DL) models used to predict phenotype and population 
structure from genomic datat416-20 are similarly negatively impacted by imbalanced datasets. 
Vokinger et al.*! highlighted the presence of bias in ML-based medicine prediction pipelines. 
Specifically, they revealed how a naive application of simple ML methods can showcase an 
overall good performance, yet still produce biased predictions favoring the majority population 
at the cost of lower accuracy for underrepresented groups. Efforts to mitigate this bias exist, 
such as Afrose et al.?? who created a double prioritized bias correction technique that involves 
training customized prediction models for specific subpopulations. However, this approach is 
limited to binary classification tasks and is not generalizable to other prediction problems. 

Conventionally, the statistical methods that are applied for genomic prediction problems 
linearly combine the effects of different genetic variants on an individual’s risk of disease. Some 
of the most widely used regression models include Lasso,” a linear method with ¢, penalty, 
Elastic net?* with 4 and 4% penalty, and efficient implementations of both.'* Although being 
the routine choice in most studies, linear models are not able to capture non-linear genetic 
interactions that can contribute to a phenotype.”° The ability of non-linear predictive models 
to capture genetic interactions could help improve performance generalization across popu- 
lations.7&27 Neural networks, a complex non-linear method, have recently gained traction in 
computational biology,?*?° but require vast amounts of data for training. Large-scale biobanks, 
such as the UK Biobank,” provide such expansive datasets. However, the small proportion 
of samples from minority populations hinders robust generalization across different genetic 
backgrounds. In contrast, gradient boosting (GB) algorithms,?! such as eXtreme Gradient 
Boosting (XGBoost)*? and LightGBM,*? have frequently demonstrated superior performance 
for tabular data and small-sized datasets,?+°° and have already been explored in biological 
studies for tasks such as local ancestry inference, protein-protein interactions,’ and drug- 
gene interactions.*® In the realm of genotype-to-phenotype prediction, recent research has also 
highlighted the potential benefits of using such nonlinear predictive models.3940 

In this paper, we aim to improve phenotype prediction for diverse and underrepresented 
populations. We propose a more inclusive genomic research approach that uses multi-ancestry 
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data together with advanced machine learning techniques to boost the predictability of com- 
plex traits across a broader range of populations. Our method leverages several machine 
learning techniques such as boosting, and ensembling, and we propose population-conditional 
weighting and re-sampling techniques to generate more accurate models for underrepresented 
populations without requiring large sample sizes of non-European training data. Fig. 1 illus- 
trates the workflow of our approach, starting with the formation of the data set through the 
application of various machine learning techniques and data de-biasing methods. We compare 
our approach with state-of-the-art statistical genetics models on the UK Biobank, conduct- 
ing a systematic evaluation across 12 phenotypes in European (British), African, East Asian, 
and South Asian individuals. Given that the majority population is of European descent, we 
observe a large gap in phenotype prediction accuracy for minority populations when using 
classical linear methods. This disparity only grows when European-only data is used to train 
any of the prediction models. We demonstrate how the application of our method helps nar- 
row this accuracy gap, balance the performance across populations, and obtain state-of-the-art 
phenotype prediction results for multi-ancestry datasets. 


Data processing Data selection Model Training De-biasing 


Simple ML 
models 


ADMIXTURE 
Ancestry 
inference 


Boosting 


fi 


European-only 
Training Data 


tO Population-conditional 
SNP Data Boosting Bere S 
Ñ weighting 
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Multi-Ethnic 
Quality control Training Data 


SNP Selection Population-conditional 
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Frequency filtering undersampling 
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Fig. 1. A schematic representation of our predictive modeling pipeline, starting from the initial 
data ingestion to the application of various ML methods and de-biasing techniques. 


2. Methods 
2.1. Dataset preparation 


We utilize a dataset extracted from the UK Biobank®? that includes European (British), South 
Asian, African, and East Asian individuals (see Fig. 2). We use the pre-computed population 
labels from the Global Biobank Engine (GBE),* inferred based on genetic clustering with 
ADMIXTURE software” results, which provides a maximum likelihood estimation of an in- 
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Fig. 2. EUR - European British, SAS - South Asian, AFR - African, EAS - East Asian 
(Left) Sample counts per group in the training and testing set. (Right) Percentage of SNP overlap 
between the selected sets of SNPs per group using the MAF filter. 


dividual’s genetic ancestry clustering from multilocus genotype datasets. 

Single nucleotide polymorphism (SNP) sequences are encoded using a ternary system, 
where at each genomic position, an individual i has ni € {0,1,2} copies of the minority SNP 
j. To address high dimensionality and retain the most informative SNPs, we apply a SNP 
selection process. Minor allele frequency (MAF) filtering is applied with a 1.25% threshold, 
keeping a set S, of 10000 SNPs for each population p € P, such that |S,| = 10000. After SNP 
selection for each population, we computed the union of these sets. It is important to note 
that not all sets necessarily overlap with every other set. The union is represented by: 

Sunion = U Sp (1) 

pEP 

This resulted in a unified set of SNPs where |Sunion| = 31153, which is then used for all 
individuals, creating a dataset of 66032 individuals and 31153 features. Fig. 2 shows the 
intersection size of the sets of selected SNPs Sp for all intersections of populations p € P. 
We observe that the highest overlap is between South Asian, East Asian, and European 
populations, while the selected set of SNPs for the African population has practically no 
overlap with the others. Any subsequent missing SNPs within the samples underwent mode 
imputation to ensure data completeness. 

To conduct our experiments, we study a set of phenotypes included in the GBE,"! listed 
in Table 1. Details regarding the correspondence of the GBE to the UK Biobank can be 
found in the GBE paper. We selected the available phenotypes with minimal missing data for 
the minority populations, and that also showed good predictive performance from genotype 
features.*? We analyze both binary phenotypes (absence or presence of the phenotype) and 
continuous phenotypes to evaluate model performance across both classification and regression 
tasks. 
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Additionally, we ensured there is no missing data and filter samples that have missing 
phenotypic information. The dataset is partitioned into a training set and a testing set, com- 
prising 80% and 20% of the data, respectively. We applied stratified sampling, ensuring the 
proportion of samples from each population closely mirrors their proportion in the overall 
dataset. 


Table 1. We present results on 12 phenotypes, 10 continuous and 2 binary ones. 


Variable Type Variable Type 
Standing height Continuous Weight Continuous 
Ankle spacing width Continuous Impedance of whole body Continuous 
HDL cholesterol Continuous Apolipoprotein A Continuous 
Urate Continuous Total bilirubin Continuous 
Plateletcrit Continuous Red blood cell (erythrocyte) count Continuous 

Diabetes Binary Atrial fibrillation Binary 


2.2. Algorithmic models 


We explore a wide range of machine learning methods to improve phenotype prediction on un- 
derrepresented populations. Some algorithms serve as standalone models, capable of making 
predictions without supplementary techniques. Other algorithms we describe in this section, 
such as boosting, are techniques that can be used to further improve the performance of a base 
machine learning model. Finally, we explore complex machine learning systems that combine 
multiple models and automate the process of machine learning. 


Linear models We include the Least Absolute Shrinkage and Selection Operator (Lasso), 
a linear regression method that performs variable selection (i.e., identifies the most important 
predictors) and regularization, which prevents overfitting by constraining the model param- 
eters. It does this by imposing an 4ı penalty, effectively reducing some coefficients to zero. 
We also use Elastic Net,” a regularized method that combines 44 and 2 penalties, allowing 
coefficient shrinkage and feature selection. 


Boosting We consider boosting,“4 a powerful ensemble machine learning technique that con- 
structs a strong predictive model by combining multiple weak learners—simple models— that 
are trained sequentially. In each iteration of the boosting process, a new weak learner is 
trained giving more importance to the instances that were poorly predicted by the previous 
models, meaning the model attempts to correct the errors of its predecessors. This procedure 
is repeated sequentially, with each new model targeting the instances where the combined 
ensemble has performed the worst. The final model is a weighted combination of all the weak 
models, which often yields a strong predictive performance by aggregating the strengths of 
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all individual models. Decision trees are the most common type of weak learners used in 
boosting algorithms. However, we also study how boosting can help improve predictive per- 
formance when traditional linear methods used in the field, such as Elastic Net, are used as 
weak learners. 


Gradient boosting machines A specific implementation of the boosting techniques are 
gradient boosting machines (GBM). The key idea behind GBMs is the use of the gradient 
descent algorithm to minimize a loss function, which quantifies how well the model predicts 
the target variable. In each iteration, rather than directly focusing on the poorly predicted 
instances, a new decision tree is fit to the negative gradient (residuals) of the loss function 
with respect to the prediction of the ensemble model from the previous stage. This new 
decision tree provides a direction in which the prediction should be adjusted to minimize the 
loss function. The predictions are then updated by taking a step in this direction. Extreme 
Gradient Boosting (XGBoost)*? and LightGBM*? are two optimized implementations of GBMs 
that have gained significant popularity due to their efficiency and performance. XGBoost 
offers several advanced features such as regularized boosting, handling of missing values, and 
tree-pruning that makes it faster and more robust. LightGBM also offers high performance 
and efficiency but is particularly notable for its effectiveness with large datasets and high- 
dimensional data, due to its innovative histogram-based algorithm that reduces memory usage 
and speeds up training. 


AutoML Automated Machine Learning (AutoML)* refers to the automated process of end- 
to-end model development, encompassing steps from feature engineering to model selection, 
hyperparameter tuning, and model evaluation. AutoML methods have been developed to 
streamline the machine learning pipeline while reducing time and expertise required to develop 
effective predictive models. In particular, we consider AutoGluon*® (AG), a state-of-the-art 
AutoML framework known for its robust performance, efficiency and ease of use. AutoGluon 
automatically trains and optimizes multiple models such as neural networks, nearest neighbors, 
linear models, and gradient boosting machines, combining them into a stacked ensemble. 


2.3. Population-conditional re-sampling solutions 


We introduce a set of population-conditional re-sampling techniques to address population 
imbalance in datasets. These techniques serve as auxiliary methods designed to reduce model 
bias towards the majority population and can be integrated with any predictive model. While 
we focus on human populations in this work, these techniques can also be applied to any data 
where samples can be grouped into different populations, groups, or categories. Moreover, 
they are suitable for tasks beyond single-target classification, such as regression, and they can 
also be extended to multi-output tasks. 


Population-conditional oversampling and undersampling We modify the traditional 
oversampling and undersampling techniques used in imbalanced classification tasks, and adapt 
them to address imbalances at the population level, regardless of the target variables (both 
categorical and continuous). We organize the training dataset as X € R‘*4 such that each row 
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represents an individual, and the target variable or variables are concatenated to the rest of 
the input features as the final attributes. The population label is then used as a downstream 
label y’ € RY for the oversampling or undersampling rule, originally designed to work with 
single-target imbalanced classification datasets, such that the “minority” samples are those 
pertaining to the populations with lowest representation in the dataset. After this procedure, 
we discard the population labels and split the columns of the re-sampled training dataset as 
features and targets and fit the prediction models. 

We explore population-conditional random oversampling (OS) by picking samples at ran- 
dom with replacement from the minority populations. We also adapt the Synthetic Minority 
Over-sampling Technique (SMOTE),4” which is commonly used to address class imbalances by 
generating synthetic samples. Our modification enables us to synthetically increase the number 
of instances from the minority populations in the training set. Note that in the case of re- 
gression tasks, our approach differs from existing SMOTE variations for regression,**49 which 
involve identifying “minority” samples based on the distribution of the target values rather 
than external categorical labels associated with the samples. Finally, we also consider adapting 
the SMOTE-Edited Nearest Neighbours (SMOTE-ENN) algorithm,*° a method that combines 
both oversampling and undersampling techniques. Our proposed population-conditional vari- 
ation can also be applied to any other re-sampling technique originally designed to address 
class imbalance in classification problems. 


Population-conditional weighting In a similar fashion, traditional class-based sample 
weighting techniques for class imbalance give more importance to underrepresented classes 
in the target variable. In contrast, we propose to emphasize the individual instances from un- 
derrepresented populations given the population labels each sample has assigned, regardless 
of their target variable. We calculate Np, the size (i.e. number of samples) of each population 
p€ P in the training set, and assign a weight wp = = to each sample corresponding to pop- 
ulation p, inversely proportional to the size of its population, where N is the total size of the 
training dataset. 


2.4. Evaluation setup 


For training, data is either filtered to only contain European ancestry individuals, mirroring 
the typical bias seen in many genetic studies, or contain the complete, multi-ethnic dataset 
that includes individuals from underrepresented populations. The testing data is fixed and 
contains samples from each population group, allowing the assessment and model performance 
comparison across each population in all the experiments. Model hyperparameters are adjusted 
by 5-fold cross validation, with hyperparameter configurations drawn from comprehensive 
search spaces until 1000 configurations are explored or a search budget of 120 hours is reached. 
Then, the model is fitted on the full training set with the chosen hyperparameter configuration, 
and evaluated on the held out test set (20% of the data). 

Predictive performance is evaluated using the coefficient of determination (R?) for regres- 
sion tasks, and the Area Under the Receiver Operating Characteristic Curve (ROC AUC) for 
classification tasks. R? represents the proportion of variance in the predicted phenotype that 
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is explained by the genotype, and its value lies between 0 and 1. An R? nearing 1 signifies the 
model’s high accuracy in phenotype prediction using the given genetic data. In contrast, val- 
ues approaching 0 highlight the model’s limited predictive capability. ROC AUC measures the 
model’s ability to distinguish between the positive and negative classes. The value ranges from 
0 to 1, with 0.5 indicating performance equivalent to random chance, and values approaching 
1 indicating high predictive accuracy. 


3. Results 
3.1. Continuous phenotypes 


We first analyze the use of multi-ethnic data and the predictive performance of several linear 
and non-linear models, including Lasso, Elastic Net, LightGBM, and XGBoost, for the 10 
continuous phenotypes described in Table 1. Fig. 3 shows the increase in R? when training 
the models with multi-ethnic data, compared to training with only with European individuals 
on a linear model (Lasso), which is the common practice in the field. Note that relative 
performance (ratio) cannot be computed per population, as the baseline model obtains an 
R? of 0 for some population groups when predicting some of the phenotypes. We observe 
that prediction performance significantly improves across all populations and methods when 
including multi-ethnic data in training. Specifically, the gradient boosting method LightGBM 
is the model that obtains the highest boost in predictive performance consistently across all 
ancestry backgrounds, including European and underrepresented ones. 
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Fig. 3. Aggregated results of increase in R? for the 10 continuous phenotypes, with a 95% confidence 
interval, comparing the scores for models trained on multi-ethnic data (including populations un- 
derrepresented in the UK Biobank) versus models trained exclusively on the British-with-European- 
ancestry population. 


In an effort to gain deeper insights into how various methodologies can influence a pheno- 
type, we focus on the Standing Height phenotype. Fig. 4 shows our experiments on different 
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models and techniques, with the complexity of machine learning techniques increasing from 
left to right. Our experiments begin with Elastic Net (EN), starting from a simple linear 
model trained on individuals of European descent. We then include multi-ethnic data and in- 
troduce population-conditional weighting during training. Subsequently, we explore creating 
an ensemble of Elastic Nets using boosting. As a more complex boosting algorithm, we include 
LightGBM, followed by AutoGluon, an AutoML method that trains multiple ML models to 
form a stacked ensemble, including LightGBM as one of its members. 
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Fig. 4. Comparison of R? scores across diverse populations for the Standing Height phenotype. 
“EN” represents Elastic Net. The population used for training is provided in parenthesis, with “EUR” 
signifying European-only training data, and “Multi-E” indicating the use of multi-ethnic data. The 
symbol “W” marks the application of population-conditional sample weighting. 


We note incremental performance for all populations, starting with Elastic Net which 
yields an R? of 0 for South Asian and East Asian individuals when trained solely on Eu- 
ropean data. Introducing multi-ethnic data leads to significant R? improvements, narrowing 
the performance gap between populations. Moreover, population-conditional weighting boosts 
performance for underrepresented groups. Finally, non-linear methods like LightGBM and Au- 
toGluon have proven especially effective for the European, South Asian and East Asian pop- 
ulations. Gains are more modest for the African samples due to the higher genetic variation 


412 


Pacific Symposium on Biocomputing 2024 


Table 2. R? results for standing height. All the proposed population-conditional (PC) re-sampling 
methods use multi-ethnic training data. EN: Elastic Net, AG: AutoGluon. 


Population ‘Training Lasso EN Boosted EN LightGBM XGBoost AG 
European-only 0.508 0.477 0.506 0.520 0.520 0.520 
Multi-ethnic 0.497 0.473 0.503 0.517 0.517 0.519 
Puropean PC-Random OS 0.451 0.435 0.492 0.503 0.501 0.510 
PC-SMOTE 0.465 0.422 0.499 0.513 0.505 0.508 
PC-SMOTE-ENN 0.189 0 0.132 0.319 0.388 0.372 
PC-Weighted 0.452 0.435 0.496 0.506 0.501 0.513 
European-only 0 0 0 0 0 0 
Multi-ethnic 0.342 0.452 0.460 0.554 0.547 0.554 
Seu Adan PC-Random OS 0.506 0.499 0.523 0.542 0.549 0.557 
PC-SMOTE 0.486 0.480 0.509 0.552 0.533 0.541 
PC-SMOTE-ENN 0.520 0.467 0.525 0.544 0.548 0.553 
PC-Weighted 0.506 0.498 0.523 0.537 0.543 0.563 
European-only 0.374 0.368 0.372 0.373 0.355 0.351 
Multi-ethnic 0.442 0.427 0.440 0.441 0.437 0.443 
eee PC-Random OS 0.442 0.426 0.439 0.386 0.429 0.439 
PC-SMOTE 0.434 0.411 0.421 0.400 0.433 0.414 
PC-SMOTE-ENN 0.443 0.397 0.427 0.401 0.431 0.418 
PC-Weighted 0.442 0.426 0.437 0.406 0.423 0.442 
European-only 0 0 0 0 0 0 
Multi-ethnic 0.174 0.426 0.413 0.535 0.513 0.534 
ear ee PC-Random OS 0.487 0.500 0.511 0.536 0.540 0.548 
PC-SMOTE 0.466 0.479 0.497 0.525 0.526 0.547 
PC-SMOTE-ENN 0.490 0.479 0.502 0.534 0.533 0.547 
PC-Weighted 0.487 0.500 0.513 0.511 0.524 0.552 


within this group, making phenotype prediction a more challenging task. Models trained on 
multi-ethnic datasets can still struggle to capture the intricate relationships between genotype 
and phenotype specific to African populations. As we integrated increasingly complex and de- 
biasing techniques, we observed an overall improvement in R?, underscoring that non-linear 
models, multi-ethnic data, and de-biasing techniques collectively drive enhanced results. 
Table 2 provides a comprehensive comparison of various models in predicting standing 
height across different ancestry groups using diverse training techniques. For the individuals of 
European descent, training with either European-only or multi-ethnic data showcased similar 
results, with LightGBM, XGBoost, and AutoGluon emerging as top performers. In contrast, 
for the South Asian and East Asian groups, introducing multi-ethnic data and applying the 
proposed population-conditional re-sampling significantly improves predictive performance. 
The best results in the Asian groups are obtained applying the population-conditional sample 
weighting with AutoGluon. For the African group, top performance was observed not only 
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with AutoGluon trained on multi-ethnic data but also with the Lasso combined with the 
population-conditional SMOTE-ENN. This finding underscores the importance of not only 
model choice but also nuanced training strategies, especially for diverse groups. 


3.2. Binary phenotypes 


We extend our experiments to classification models to observe if they follow similar trends as 
the regression results presented above. Table 3 showcases the ROC AUC results for two bi- 
nary phenotypes (diabetes and atrial fibrillation). For both phenotypes, AutoGluon frequently 
achieves the highest ROC AUC scores, followed by LightGBM, outperforming the linear mod- 
els. Particularly, the population-conditional weighted training improves model outcomes for 
the underrepresented groups when using multi-ethnic data. 


Table 3. Performance of various models and training techniques in predicting binary phe- 
notypes (Diabetes and Atrial Fibrillation), as measured by ROC AUC scores per group. The 
proposed population-conditional (PC) method uses multi-ethnic training data. 


Phenotype Population ‘Training Lasso Elastic Net LightGBM AutoGluon 
European-only 0.520 0.530 0.585 0.604 
European Multi-ethnic 0.501 0.508 0.606 0.616 
PC-Weighted 0.494 0.495 0.562 0.610 
European-only 0.546 0.547 0.550 0.570 
South Asian Multi-ethnic 0.535 0.535 0.539 0.562 
Diabetes PC-Weighted 0.528 0.533 0.563 0.586 
European-only 0.508 0.527 0.483 0.509 
African Multi-ethnic 0.527 0.533 0.507 0.494 
PC-Weighted 0.516 0.516 0.543 0.493 
European-only 0.391 0.409 0.480 0.552 
East Asian  Multi-ethnic 0.421 0.385 0.554 0.579 
PC-Weighted 0.400 0.429 0.638 0.558 
European-only 0.537 0.537 0.591 0.625 
European Multi-ethnic 0.538 0.539 0.594 0.624 
PC-Weighted 0.538 0.537 0.609 0.629 
European-only 0.504 0.485 0.562 0.513 
Atrial South Asian Multi-ethnic 0.478 0.498 0.547 0.548 
PESI, PC-Weighted 0.479 0.501 0.487 0.586 

fibrillation 

European-only 0.544 0.521 0.559 0.665 
African Multi-ethnic 0.554 0.532 0.523 0.592 
PC-Weighted 0.550 0.509 0.499 0.566 
European-only 0.350 0.424 0.596 0.405 
East Asian Multi-ethnic 0.313 0.397 0.507 0.459 
PC-Weighted 0.322 0.424 0.542 0.374 
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4. Conclusions and Future Work 


Our results advocate for the implementation of non-linear and ensemble methods, particularly 
LightGBM and AutoGluon, combined with the proposed population-conditional techniques to 
enhance genotype-to-phenotype prediction tasks for populations underrepresented in existing 
datasets. Strategies such as boosting and population-conditional sample weighting and re- 
sampling proved to be influential additions in order to better generalize across population and 
improve prediction accuracy. These methods were effective for both continuous and binary 
phenotypes, demonstrating their applicability for both regression and classification models. 

Our study illustrates the use of methodological advancements to enhance prediction accu- 
racy in the face of a lack of diverse genetic datasets. While the ideal solution would simply be 
the inclusion of more representative datasets, this is not an accurate reflection of the current 
data landscape. As such, we recommend for our models and techniques to be implemented 
when researchers are dealing with datasets of biased representation, especially in genetics. 
Using these methods should be a priority in situations demanding equitable outcomes, such 
as in clinical studies. 

Failure to address these disparities could engender biases in precision medicine, which 
might unfavorably impact underrepresented populations. While our study addressed twelve 
phenotypes, expanding this focus to include other disease phenotypes in future research could 
yield a deeper understanding of genetic influences on disease. Although AutoGluon includes 
simple neural network models, future work could delve into a broader spectrum of deep learning 
architectures, including convolutional layers and attention mechanisms. 

The moderate improvement in the African population compared to the Asian groups when 
applying multi-ethnic training and population-conditional re-sampling can be attributed to 
the inherent genetic diversity present within the African group, as the SNPs selected for this 
study are predominantly enriched for representation in Eurasian populations. For future work, 
a more refined SNP selection tailored for more diverse ancestral backgrounds could poten- 
tially enhance the predictive performance and rectify this limitation. A deeper investigation 
into linkage disequilibrium among SNPs could also optimize the SNP selection process by 
minimizing redundancies. Although models studied are able to capture genotype-phenotype 
relationships, covariates, particularly genetic principal components, could allow for a more ac- 
curate accounting of the underlying population structure. Incorporating advanced explainable 
ML techniques”! alongside further analysis of covariates can elucidate the underlying mecha- 
nisms through which non-linear relationships boost predictive performance, offering a clearer 
insight into genotype-phenotype mappings. These approaches could refine model performance 
and enhance prediction accuracy across different ancestry backgrounds. 

Given the prevalent bias in many clinical and genetic datasets,'° underrepresented pop- 
ulations are often overlooked, with potentially grave implications for health outcomes. This 
issue is especially pertinent in an era where precision health methods and AI algorithms are 
becoming increasingly prominent. Thus, implementing strategies such as those presented in 
our study could considerably enhance the equability and effectiveness of precision medicine 
for underrepresented groups. 
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This study quantifies health outcome disparities in invasive Methicillin-Resistant Staphylococcus 
aureus (MRSA) infections by leveraging a novel artificial intelligence (AI) fairness algorithm, the 
Fairness-Aware Causal paThs (FACTS) decomposition, and applying it to real-world electronic 
health record (EHR) data. We spatiotemporally linked 9 years of EHRs from a large healthcare 
provider in Florida, USA, with contextual social determinants of health (SDoH). We first created a 
causal structure graph connecting SDoH with individual clinical measurements before/upon 
diagnosis of invasive MRSA infection, treatments, side effects, and outcomes; then, we applied 
FACTS to quantify outcome potential disparities of different causal pathways including SDoH, 
clinical and demographic variables. We found moderate disparity with respect to demographics and 
SDoH, and all the top ranked pathways that led to outcome disparities in age, gender, race, and 
income, included comorbidity. Prior kidney impairment, vancomycin use, and timing were 
associated with racial disparity, while income, rurality, and available healthcare facilities contributed 
to gender disparity. From an intervention standpoint, our results highlight the necessity of devising 
policies that consider both clinical factors and SDoH. In conclusion, this work demonstrates a 
practical utility of fairness AI methods in public health settings. 


Keywords: AI fairness; Methicillin-resistant Staphylococcus aureus; Health outcome disparity 


1. Introduction 


Invasive Methicillin-Resistant Staphylococcus aureus (MRSA) infections pose a significant public 
health concern. According to the Centers for Disease Control and Prevention (CDC), MRSA 
infections account for a substantial proportion of healthcare-associated infections, affecting both 
inpatient and outpatient settings'. These infections, characterized by resistance to all beta-lactam 
antibiotics, have been associated with increased morbidity, mortality, and healthcare costs. 

It is widely recognized that socioeconomic and demographic factors influence transmission and 
care outcomes of infectious diseases, including MRSA. For example, See et al. (2017) shed light on 
the complex interplay between race, socioeconomic factors, and MRSA infections”. Gualandi et al. 


t Work partially supported by grant NIH NIAID 1R01AI141810; NIH NIA R33AG062884; 
© 2023 The Authors. Open Access chapter published by World Scientific Publishing Company and distributed under 
the terms of the Creative Commons Attribution Non-Commercial (CC BY-NC) 4.0 License. 
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(2018) analyzed surveillance data in the USA from nine US states (20+ million people) and found 
that the risk of MRSA infection in African Americans was double the risk in other racial/ethnic 
groups, even when rates were decreasing*. Nonetheless, Mohnasky et al. (2021) found that in a 
prospective cohort of individuals seen in a single, large US medical center for over 20 years, social 
disparity in MRSA outcomes was explained by differences in comorbidities between racial/ethnic 
groups. Thus, contrasts among studies could be explained by population selection and modeling 
choices. Many studies on quantification of health outcome disparity within invasive MRSA 
infections have been associational in nature, and further research is necessary to deconstruct and 
understand the underlying causal mechanisms driving such disparity in order to identify potential 
avenues for intervention. Such advancement can help develop effective strategies to mitigate the 
impact of the disease which target not only the majority of the population, but also specific 
subpopulations that might be more vulnerable, e.g., the elderly or ethnic/racial minorities. 

To address the aforementioned challenges, we employ a recently developed artificial 
intelligence (AI) fairness algorithm, the Fairness-Aware Causal paThs (FACTS) decomposition’. 
FACTS is able to decompose disparity of an outcome measure with respect to a variable of interest 
into multiple causal pathways, and to quantify the relative contribution of each path. We apply 
FACTS on large real world electronic health record (EHR) data collated over 9 years from a large 
healthcare provider in Florida, USA, linked with contextual social determinants of health (SDoH). 


2. Materials and Methods 


2.1. Ethics Statement 


This study obtained approval from the Institutional Review Board (#IRB201900652) of the 
University of Florida (UF). The authors strictly adhere to the research integrity and ethical principles 
outlined in the Declaration of Helsinki. 


2.2. Data Source 


We analyzed deidentified EHR data from the UF Health’s Integrated Data Repository (IDR, 
https://idr.ufhealth.org/), which includes two primary hospitals in Gainesville and Jacksonville, and 
several other outpatient clinics in Florida. The IDR-EHR data includes patients’ demographics, 
residence (here masked into 3-digit zip codes), laboratory tests (encoded with Logical Observation 
Identifiers Names and Codes, LOINC), drug prescriptions (RxNorm terminology), clinical 
procedures and diagnoses (International Classification of Disease, ICD 9" and 10" revision). In this 
study all ICD-10 codes were converted to ICD-9 format following General Equivalence Mappings 
guideline of Centers for Medicare & Medicaid Services® since the sample predominantly consisted 
of ICD-9 codes. Data requests can be directed to IDR (https://idr.ufhealth.org/research-services/) in 
compliance with institutional, state and US Federal policies; authors are willing to share study 
procedures for reproducing results. 

We linked individual patient records to the county-level social determinants of health (SDoH) 
variables using multiple external sources. SDoH variables used in this work were: Median 
Household Income’; Rurality (urban or rural based on the Federal Bureau of Investigation 
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Patients diagnosed as Invasive MRSA between 2011 and 2019 (n=1,433) 
Based on culture test on blood, fluid, bone, and other internal body site 


Adult patients who had at least one medical record 
at least one year ahead of their first MRSA diagnosis (n=914) 
Patients who had admission and discharge record with the culture test (n=872) 


Having medication records from Time 1 to Time 3 (n=427) 
Patients with matched socio-demographic information (n=395) 


Study Population (n=395) 


Death within 30-days from the onset No Death within 30-days from the onset 
(n=30, 7.6%) (n=365, 92.4%) 


Fig. 1. Flowchart of Study Population 


metropolitan criteria)*; Health Insurance Coverage (proportion of residents of uninsured 
populations)’, and Access to Healthcare Facilities (proximity and availability of healthcare facilities 
such as hospitals, clinics, and primary care providers in the area; number of hospital beds per 
100,000 population)!°. 


2.3. Study Design, Study Population, Variables 


We included adults aged 18 years and older at the time of diagnosis of invasive MRSA (ICD-9-CM: 
041.11) at UF Health between January 1, 2011, and July 1, 2019. To ensure comprehensive medical 
information availability, patients without complete sociodemographic information and without a 
prior medical record from at least one year before their first invasive MRSA diagnosis were 
excluded. Excluding these patients mitigated potential bias arising from missing comprehensive past 
medical information. To follow up comprehensive antibiotic treatment, we defined three time points 
for each patient’s antibiotic treatment. Time 1 was defined as the empiric treatment stage which the 
patient will receive without any test results confirmed when they got infected. Time 2 was defined 
as the time when their initial antibiotic susceptibility testing was revealed, and Time 3 was 7 days 
from time 2 (reflecting patient’s latest clinical progression). A detailed clinical justification of the 
choice of the three time points is given in a prior work !!. Individuals who were missing antibiotic 
treatment history for those three time points were dropped from the study. Fig. 1 provides an 
overview of the inclusion criteria cascade. 

The study’s index/baseline date was set corresponding to the first invasive MRSA diagnosis, 
and the outcome was 30-day mortality. The patients’ variables at index date included age, gender 
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(male vs. female), race (African American vs. White), Charlson’s comorbidity index (CCI), history 
of antibiotic usage, prior history of kidney impairment, types of infection (i.e., bloodstream infection 
or not), severity of infection (i.e., transfer to intensive care unit, ICU, or not), and the SDoH panel. 
Additional variables upon admission included the treatment course (i.e., whether they received 
vancomycin or not at each time point), and side effects (i.e., nephrotoxicity developed after the 
initial treatment). 


2.4. Causal Assumptions and FACTS 


Using literature search and authors’ consensus, we created a partially directed acyclic graph (pDAG) 
connecting SDoH with individual clinical measurements before/upon diagnosis of invasive MRSA 
infection, treatments, side effects, and outcomes. Double-edged arrows might represent unmeasured 
confounding between two variables (e.g., income and rurality). 

Each arrow in the pDAG is supported by at least one finding from our literature search. Race 
was associated with previous vancomycin use!”!’, types of infection", severity of infection", prior 
kidney impairment", prior drug resistance*!®, income”, and chronic comorbidities!**>. Income and 
health insurance were linked’+”°. Sex was associated with health insurance”°, income”, and chronic 
comorbidities” °°. Income and rurality were linked*!. Rurality was also associated with access to 
healthcare facilities? , and healthcare facilities were associated with chronic comorbidities**® . 
Age was associated with health insurance coverage?” and chronic comorbidities”!**?’, Previous 
vancomycin use was associated with prior drug resistance** and vancomycin at Time 1°”. Type 


Previous vancomycin use 


Prior kidney impairment Prior Drug Resistance 


Types of Infection 


wai Severity of Infection 
Income 


~ 
Insurance 


A y 
Vancomycin at T1 Vancomycin at T2 Vancomycin at T3 Mortality 


—wvV i 
Chronic Comorbidity. 


Rurality Healthcare Facilities Nephrotoxicity developed after T1 


Fig. 2. Partially directed acyclic graph representing the causal relationships among clinical, 
sociodemographic variables, and MRSA 30-day mortality (race is displayed as the exposure 
variable). 


422 


Pacific Symposium on Biocomputing 2024 


40,41 9 


of infection", severity of infection?! , chronic comorbidities? , and prior drug resistance?’ were 
also associated with vancomycin at Time 1. Vancomycin at T1 was associated with vancomycin at 
Time 2, and vancomycin at Time 2 was associated with vancomycin at Time 3. Chronic 
comorbidities and prior kidney impairment were associated with nephrotoxicity which developed 
after Time 1*°. Vancomycin at Time 2 and Vancomycin at Time 3 were also linked with 
nephrotoxicity which developed after Time 1***°. Vancomycin at Time 3 was associated with 
mortality*®. 

The final pDAG is provided in Fig. 2. We selected race, income, gender, and age as exposure 
variables. The pDAG was used to calculate an adjustment set to identify the effect of the exposures 
with respect to MRSA outcome, quantifying the potential disparity in terms of odds ratios using a 
main-effects logistic regression. After this analysis, we applied the FACTS on our pDAG using the 
same exposures’. In detail, FACTS builds a prediction model of the outcome using all variables 
(through the XGBoost algorithm), then uses a given pDAG and a ‘sensitive’ attribute of interest 
(i.e., exposure, like gender or race) to calculate the contribution to outcome disparity for all paths 
involving such sensitive attribute. Finally, it ranks and outputs the most important paths. 


2.5. Software 


We conducted our analyses in R (https://www.r-project.org/), using the libraries ‘tidyverse’*” and 
‘data.table’*® for data preprocessing, ‘comorbidity’” for calculating Charlson’s comorbidity index 
(CCI), and ‘tidycensus’™ for extracting the US Census Bureau’s data APIs. The DAG and the 
adjustment sets were done with dagitty (https://www.dagitty.net/). For the FACTS analysis, we 
applied Python based on the code available at: https://github.com/weishenpan15/FACTS. 


3. Results 


3.1. Characteristics of the Study Population 


We identified 1,433 individuals admitted to the hospital and diagnosed with an invasive MRSA 
infection between 2011 and 2019, based on the bio/tissue-sample source and the culture test (i.e., 
blood, fluid, bone, and other internal body site). After matching with socio-demographic 
information based on the three digits zip-code information of each patient, and after filtering based 
on all inclusion criteria as described in the methods, a total of 395 patients constituted the final study 
sample (Fig. 1). 

Table 1 describes the baseline characteristics of the sample. Patients were 55.2 years old on 
average. The percentage of females was 50.1%, and 39.0% of the population was African American. 
In terms of county-level SDoH, the average number of healthcare facilities was 189, 50.9% of 
individuals lived in urban areas, the median household income was $48,300, and 21.8% was the 
prevalence of being uninsured. The mean Charlson’s comorbidity index was 6.40. The prevalence 
of patients who were administered vancomycin before this invasive MRSA infection was 68.6%, 
and for 35.9% of patients, there was record of multiple drug resistance (MDR). Eighty percent of 
patients had a bloodstream infection, and 52.2% of patients were transferred to the ICU. The 
percentages of vancomycin usage at each time point were 97%, 79%, and 66.1% respectively. While 
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Table 1. Variable Characteristics of the Study Population (N=395) 


Measure; 
Mean (SD), Median [Min, Max] or N (%) 


Variables 


Individual-level EHR variables 


Age 55.2 (16.4), 56 [19, 96] 
Age — 65+ years old 121 (30.6%) 
Gender — Female 198 (50.1%) 
Race — African American 154 (39.0%) 
Charlson’s comorbidity index (CCI) 6.40 (3.92), 6 [0,20] 
History of antibiotic usage (Vancomycin) 271 (68.6%) 
Prior history of kidney impairment 211 (53.4%) 
Types of Infection - Bloodstream 316 (80.0%) 
Severity of Infection — ICU stay 206 (52.2%) 
Prior Drug Resistance 142 (35.9%) 
Nephrotoxicity developed 40 (10.1%) 
Vancomycin use at Time1 383 (97.0%) 
Vancomycin use at Time2 312 (79.0%) 
Vancomycin use at Time3 261 (66.1%) 
County-level sociodemographic variables 
Number of Healthcare Facilities (number of 189 (30.3), 192 [107, 367] 
beds/100,000) 
Area — Urban 201 (50.9%) 
Median household Income $48,300 (7,900), $44,700 [$39,500, $67,400] 
Insurance coverage (% of Uninsured) 21.8 (4.17), 21.9 [13.8, 35.7] 
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Fig. 3. Annual trends in invasive MRSA cases, EHR patients, and local population 


being treated, 10.1% of patients developed nephrotoxicity, and 53.4% had prior history of kidney 
impairment. 
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In Fig. 3, we described the annual trend of the number of patients with invasive MRSA 
infection in our study sample. Additionally, we depicted the number of bacterial infection patients 
registered in our EHR system and the population of two Florida counties where large hospitals of 
the UF Health Network are situated (i.e., Alachua county and Duval county). The combined 
population of both counties exhibited an increasing trend over the years, while the number of 
invasive MRSA cases fluctuated annually. For the invasive MRSA cases, the data only covers half 
of 2019, from January to July. 


3.2. Quantification of Health Outcome Disparity 


We estimated the total and direct effects of age, race, income, and gender on the 30-day mortality 
outcome. For each of the exposure variables, we report the odds ratios (OR) and 95% confidence 
intervals (CI) obtained by fitting a logistic regression model with the adjustment set variables 
identified through the pDAG (Table 2). Income showed the strongest association with outcome 
disparity (total effect OR 0.44, 95% CI 0.17-0.99), followed by age, gender, and race. All effect 
estimates, except that of income, included OR=1 in the 95%CI. 

We then ran the FACTS algorithm on the same set of exposures (Table 3). Of note, FACTS 
needs all binary variables, so we split the numeric variables based on their median. Overall, all 
paths showed absolute low weights, close to zero, for both accuracy and disparity metrics. Results 
did not change when including only clinical variables or clinical and SDoH variables in the pDAG 
and associated paths. There were no relevant paths detected for income. Comorbidity was detected 
as a disparity path for age, race, and gender. Antibiotic use, timing, and renal toxicity were 
relevant with respect to race, while income, rurality and number of healthcare facilities were 
relevant for gender disparity. 


Table 2. Total and direct effects of age, race, income, and gender on to risk of 30-day 
mortality in invasive MRSA Infection. 


Sensitive Variabl Model Odds Ratio 
ensitive Variable ode 
(95% CD 
i Out X 2.11 
tee old =1 vs aoe ae 
ears old =1 vs. 
- 0) Outcome ~ Age + Chronic Comorbidity + Previous Vancomycin Use + 1.64 
ounger= 
ye Prior Drug Resistance + Severity of Infection + Types of Infection (0.74, 3.60) 
Race 0.77 
Out ~R 
(African American=1 Pee) Toa (0.34, 1.65) 
et 0) Outcome ~ Race + Types of Infection + Severity of Infection + Previous 0.58 
ite= 
Vancomycin Use + Prior Drug Resistance + Chronic Comorbidity (0.24, 1.30) 
pote Outcome ~ Income as 
(Below median=1 vs. (0.17, 0.99) 
= . 0.74 
Upper=0) Outcome ~ Income + Chronic Comorbidity + Race + Rurality + Gender (0.24, 2.14) 
Gend Out Gend 11> 
ender utcome ~ Gender 
Premise tvs (0.54, 2.45) 
emale=1 vs. 
Male=1) Outcome ~ Gender + Chronic Comorbidity + Previous Vancomycin Use + 1.24 
ale= 
Prior Drug Resistance + Severity of Infection + Types of Infection (0.57,2.72) 
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Table 3. FACTS decomposition of disparity in 30-day mortality from invasive 
MRSA infection, with respect to age, race, income, and gender 


ay Clinical-only Clinical + SDoH 
Sensitive 
Variable Disparity . : Disparity : A 
Path Disparity Accuracy Path Disparity Accuracy 
Age Comorbidity -0.01162 0.00840 - - - 
Comorbidity -0.05714 0.03361 Comorbidity -0.05428 0.02857 
Prior 
Prior Kid 
ee 0.02571 -0.01512 Vancomycin -0.01142 -0.00672 
Impairment 
Use 
Nephrotoxici 
Race ty developed 
after Time 1 
«Vancomyc ; : 
Prior Kid 
in use at 0.01428 -0.00840 RA 0.00857 -0.00504 
Í Impairment 
Time 2 <> 
Vancomycin 
use at Time 
3 
Income - - - - - - 
Comorbidity -0.05084 0.02521 Insurance -0.05101 0.02857 
I 
é s - aie 0.01333 -0.00672 
Gender Rurality 
Income— 
- - - Rurality > 0.00265 0.02857 
Facility 


4. Discussion 


We deconstructed sociodemographic disparity in 30-day mortality among invasive MRSA 
infections, using EHR data and fairness AI methods. Upon explicit expert-derived causal 
assumptions, we found moderates to strong effects of age, gender, race, and income on mortality, 
although the 95% CIs included no difference in risk among groups. Our fairness analysis, conducted 
using the FACTS algorithm, revealed that comorbidity status was the most significant contributor 
to outcome disparity across age, race, and gender, while no distinct paths could be found for income. 
For race, antibiotic usage, timing, and prior kidney impairment contributed to disparity, while SDoH 
contributed to outcome disparity among genders. Age and income are well-known risk factors for 
mortality, and confirming their effects was clearly expected. Prior kidney impairment, identified 
through pre-infection creatinine levels, could contribute to the observed differences in invasive 
MRSA mortality rates between racial groups. Kidney impairment significantly influences the 
clinical management of MRSA infections in hospitals. Beyond its effect on the immune response, 
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renal impairment also complicates the choice and dosage of anti-MRSA antibiotics that can be safely 
administered. For instance, vancomycin, the most commonly used antibiotic agent for treating these 
infections and a known iatrogenic cause of acute kidney injury, necessitates close monitoring and 
dosage adjustments based on renal function“? . Further, while creatinine levels served as an indicator 
of renal function in our analysis, clinical teams during the study period were likely assessing for 
renal impairment using creatinine-based equations that vary by race, e.g., estimated glomerular 
filtration rate (eGFR), and changing accordingly the medical management of the patient. As a result, 
use of eGFR in clinical practice could have confounded the disparity paths that we decomposed*!. 
Compared to other studies, our findings are consistent with recent literature that analyzed 
individuals diagnosed with S. aureus bacteremia, reporting no differences in mortality between 
racial groups*>*. However, it has to be noted that study populations are heterogeneous and 
demographic groups exhibit strong differences in risk factors. We found that variations in mortality 
rates are partially attributable to the burden of underlying comorbidities, therapeutic choices, and 
SDoH that differ among ages, incomes, genders, and races. Of note, our results align with another 
recent analysis of EHR in Florida that quantified the effect of demographics and SDoH on outcome 
disparity in urinary tract infections (UTIs), where comorbidity, number of healthcare facilities, 
income and insurance were also found to be involved in disparity paths with respect to race. !! 

The decision to use FACTS for this study was driven by the algorithm’s emphasis on causal 
pathways which account for both directed and undirected arrows between variables in partially 
directed acyclic graphs (pDAGs). While statistical-based algorithms focus on assessing whether all 
groups have the same metric for outcomes, causal-based fairness is more concerned with analyzing 
the presence of causal effects of a sensitive attribute on outcomes, including path-specific fairness.>* 
Although studies exist that focus on path-specific effects **>’, the FACTS algorithm introduced a 
novel approach. The algorithm concentrates not only on causal paths but also on uncovering 
overlooked sources of disparity that may contribute to model disparity. The capability of the FACTS 
algorithm to consider undirected relationships among risk factors in pDAGs is pivotal, especially 
when relationships are unclear. Therefore, public health researchers could benefit directly from 
using advanced algorithm (1.e., FACTS) by quantifying the unknown weights of factors to the model 
disparity. 

Our study has several limitations. Firstly, our causal assumptions may be incorrect, and we did 
not account for unmeasured confounders in our models. Despite researcher’s best efforts to define 
a DAG, it remains a challenge in real-life situations to accurately represent all variable relationships. 
This brings up concerns of incorrect assumptions and the potential for reverse causality. However, 
both FACTS and the generalized adjustment criterion can work with partial DAGs, which can 
mitigate some of these issues. Another recommended approach is to estimate effects using multiple 
DAGs, each incorporating different assumptions. Secondly, due to our strict inclusion criteria, only 
one third of patients who potentially had an invasive MRSA infection were included, which likely 
introduced selection bias (i.e., exclusion bias). Exclusion bias arises when particular members of a 
population are excluded from a study due to criteria set by researchers. The patients who were not 
included in our study constituted about two-thirds of the patients with invasive MRSA; these 
patients were mainly excluded due to a lack of sequential antibiotic prescription records. The 
excluded patients might have exhibited different disparity pathways if sufficient information had 
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been available to conduct such disparity analyses. However, by starting from what is available to 
us, discovered pathways by FACTS could provide initial inferences about the invasive MRSA 
population. These inferences could be further refined as principles for secondary data collection 
become more standardized in research and therefore minimizing missing information. Therefore, 
despite the constraint of not encompassing every patient in our EHR, our study can still offer 
valuable and profound insights. We aimed to identify diverse causal pathways of disparity and by 
meticulously delineating our cohort definition in this preliminary analysis. This approach was 
intended to curtail information bias and mitigate the impact of missing data. By sharing our method 
transparently, we seek to contribute meaningful insights, informed by a clear and comprehensive 
understanding of the available data, that can elucidate the disparity pathways prevalent in the 
broader invasive MRSA population. Thirdly, given the sample size and observed effect sizes, type 
II errors were also likely. Fourthly, the current release of the FACTS algorithm is capable of 
handling only binary variables; it is anticipated that future versions of the algorithm will expand its 
capabilities. 


5. Conclusion 


In conclusion, this work demonstrates the practical utility of fairness AI methods in public health 
settings. The FACTS framework can be useful to explore intervention strategies for optimizing 
health outcomes among different sociodemographic groups using actionable variables in the causal 
pathways, e.g., reducing rates of comorbidities in vulnerable populations, and equalizing SDoH. For 
future studies, it is paramount to relax the population selection constraints, and to explore multiple 
different causal assumptions to reduce residual bias. 
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Imputation of race and ethnicity categories using genetic ancestry from real-world genomic 
testing data 


Brooke Rhead’, Paige E. Haffener , Yannick Pouliot, and Francisco M. De La Vega’ 


Tempus Labs, Inc. 
Chicago, IL, 60654, USA 


The incompleteness of race and ethnicity information in real-world data (RWD) hampers its utility 
in promoting healthcare equity. This study introduces two methods—one heuristic and the other 
machine learning-based—to impute race and ethnicity from genetic ancestry using tumor profiling 
data. Analyzing de-identified data from over 100,000 cancer patients sequenced with the Tempus xT 
panel, we demonstrate that both methods outperform existing geolocation and surname-based 
methods, with the machine learning approach achieving high recall (range: 0.859-0.993) and 
precision (range: 0.932-0.981) across four mutually exclusive race and ethnicity categories. This 
work presents a novel pathway to enhance RWD utility in studying racial disparities in healthcare. 


Keywords: Race; ethnicity; ancestry; imputation, disparities, equity, real-world data. 


1. Introduction 


Real-world data (RWD) offers insights into disease etiology, therapy outcomes, and racial 
disparities in healthcare.!* However, its utility in improving healthcare equity is limited by the 
significant sparsity of race and ethnicity data. This gap, attributable to factors such as lack of capture, 
data loss during transfer and de-identification,** and shortcomings in electronic health record 
integrations,’ leads to reliance on limited, potentially biased datasets that may result in poorly 
generalizable results and biased disease outcome predictors.* 

Several remediation strategies have been proposed, including improving data collection, 
conducting complete case analysis, modeling missingness in analyses, supplementing with 
additional data, and employing imputation methodologies.* Existing imputation methods, many of 
which leverage census data based on geolocation and correlations between people's surnames and 
their self-reported race and ethnicity,°’ achieve moderate accuracy and require access to protected 
health information (PHI), limiting their applicability.*” 

Molecular tumor profiling, an assay used in support of therapy decisions in cancer patients, is 
often accompanied by a wealth of multimodal RWD that, once de-identified, can be harnessed for 
research.'° This can include clinical metadata, imaging, and molecular data, such as DNA variants 
on a set of cancer related genes and transcript sequences from different patient tissues. !! 

Inferring genetic ancestry, or more accurately, genetic similarity to reference populations,” from 
molecular testing sequencing data, offers a potential solution to the challenge of missingness in race 
and ethnicity data. The granularity of such inferences is contingent on the availability of allele 
frequency data across samples from reference populations, with the most common level of genetic 
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ancestry inference being at super-population level categories, as described by the 1000 Genomes 
Project.'? Although genetic ancestry is not equivalent to race or ethnicity, a strong correlation 
between these two concepts has been observed among US populations.'*!> We propose to leverage 
this correlation and the genetic information available in molecular testing RWD using two methods 
— one heuristic and the other based on machine learning — to impute mutually exclusive race and 
ethnicity categories from genetic ancestry. Here, we benchmark these methods and find they 
outperform previously reported race and ethnicity imputation methods, with a machine learning- 
based method providing the most accurate imputation. 


2. Methods 


The categorizations of race and ethnicity in this study adhere to the standards developed by the US 
Office of Management and Budget,!® which are also used in the US census. These standards are 
based on two self-reported questions: a) Race (American Indian or Alaska Native, Asian, Black or 
African American, Native Hawaiian or Other Pacific Islander, and White); and b) Ethnicity 
(Hispanic or Latino and Not Hispanic or Latino). However, these categories present analytical 
challenges due to the orthogonal race and ethnicity questions, and it is often more practical to 
consolidate answers to these two questions into non-overlapping classes,'’ defined in this study as: 
Hispanic or Latino, non-Hispanic (NH) Asian, NH Black, and NH White, with the other races having 
insufficient numbers at the moment to develop reliable models in our source data. This consolidation 
allows for a more streamlined and comprehensive analysis of race and ethnicity in the context of 
RWD. 


2.1. Data 


Genomic and clinical data from patients of multiple cancer diagnoses was obtained from the Tempus 
database. The selected cohort consisted of 132,523 de-identified records of patients whose tissues 
were sequenced with the Tempus xT next-generation sequencing (NGS) panel (596-648 genes, v2- 
v4, tumor-normal matched when tissue available)!''* from 2018 to 2022. These records had been 
previously de-identified for other studies and passed minimal data quality filters. A total of 33,232 
records had populated race, ethnicity, and geolocation data and belonged to one of the four non- 
overlapping race and ethnicity categories that we imputed: 4,357 Hispanic or Latino, 1,258 NH 
Asian, 3,120 NH Black, and 24,497 NH White. Race and ethnicity information in the Tempus 
database is obtained from a combination of electronic health record integrations and data abstraction 
from clinical documents and can be self-declared by patients or observed by practitioners. 
Information could be missing because there was no attempt to collect it, because patients or 
practitioners abstained from answering, or because it was not captured in the Tempus database. 
Analyses were performed using de-identified data under human subject research exemption granted 
by Advarra, Inc. Institutional Review Board, protocol Pro00042950. 


2.2. Determination of genetic ancestry 


We estimated genetic ancestry proportions using a re-implementation of the ADMIXTURE 
supervised global genetic ancestry estimation algorithm.'? This approach calculated the proportions 
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of ancestries for five super-populations—Africa (AFR), the Americas (AMR), East Asia (EAS), 
Europe (EUR), and South Asia (SAS)—using a previously published bespoke set of 654 ancestry 
informative markers (AIMs).”° Briefly, AIMs were selected from single-nucleotide variants present 
in the reference samples that intersect with the targeted regions of the Tempus xT NGS assay, are 
not protein-changing, and are present at significantly different frequencies across the reference 
populations.*! We sourced reference allele frequency data for these AIMs from the 1,000 Genomes 
Project,'? the Human Genome Diversity Project,” and the Simons Genome Diversity Project 
databases.” In the case of the AMR super-population, we excluded the 1,000 Genomes Project's 
admixed "AMR" population and only included allele frequencies for Native American individuals 
available in the other sources. To evaluate the accuracy of our methods, we compared our global 
ancestry proportion estimates on whole-genome sequencing data from the Pan-Cancer Analysis of 
Whole Genomes Project,” with published global ancestry proportions determined by summing 
genome-wide local ancestry segments derived using the RFMix method.” This comparison yielded 
an average mean squared error, normalized to the sum of population proportions present in the 
dataset, of 0.12. The Tempus xT assay utilizes matched normal tissue when available (present for 
51% of the study cohort) to classify variants as either germline or somatic, but germline variants 
can still be inferred in the absence of normal tissue.'! The genetic ancestry proportion estimation 
method utilizes variant calls from normal tissue or those deemed to be germline. To assess 
performance when no matched normal tissue is available, we estimated proportions from both the 
tumor sample and the matched normal sample for a subset of patients (N = 3,358) and found that the 
five estimated proportions were highly concordant, with Pearson’s correlation coefficient ranging 
from 0.9977 to 0.9999,7° 


2.3. Benchmarking and performance metrics for race and ethnicity category imputation 


We relied on our cohort’s stated race and ethnicity data as available in the Tempus database as our 
ground truth. To assess the performance of imputation methods, we employed a range of accuracy 
measures specific to each predicted race or ethnicity category. Recall, also called sensitivity or true 
positive rate,” measures the proportion of individuals correctly assigned to a category among all 
individuals truly in that category. Precision, or positive predictive value,”® is the fraction of relevant 
instances among the retrieved instances, i.e., the proportion of correctly assigned individuals among 
all those assigned to a category. The F/-score is the harmonic mean of precision and recall, 
providing a balance between these two metrics. We also evaluated several measures of overall 
accuracy. Cohen's kappa?” is a measure of agreement between predicted and true categories, 
accounting for the possibility of agreement occurring by chance. The correct rate, or accuracy, 
measures the proportion of all predictions that are correct.” Log loss quantifies the difference 
between predicted probabilities of belonging to a class and the true value (0 or 1) of belonging to 
that class, with lower log loss indicating better model performance*. The area under the receiver 
operating characteristic curve, or AUC, is a measure of model performance based on sensitivity and 
specificity across all classification thresholds and thus is not sensitive to any specific chosen 
threshold. prAUC is an analogous measure based on precision and recall. A predicted probability 
threshold of >0.5 was used for all metrics that rely on a single classification for each subject. 
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In addition to these common measures, we also utilized metrics proposed by Elliot et al. The 
weighted error compares the true prevalence of race/ethnicity in the validation dataset to the 
predicted prevalence, providing an indication of the overall error rate. The weighted 
correlation measures the weighted average correlation (calculated using vectors of indicators) 
between true race and ethnicity and imputed category for each of the four categories, with weights 
equal to true prevalence. Together, these metrics offer a comprehensive evaluation of the 
performance of our imputation methods. 


2.4. Heuristic imputation of race and ethnicity 


We initially imputed mutually exclusive race and ethnicity categories from genetic ancestry 
proportions using a set of heuristics (Table 1) in part derived from admixture proportions reported 
in the literature for Black and Hispanic or Latino groups in the United States.!° We defined four 
categories: Hispanic or Latino, NH Asian, NH Black, and NH White. Patients who did not fit the 
categories defined by these heuristics were labeled “complex.” This latter category could be 
considered a no-call, as patients classified as such are typically excluded from any downstream 
analyses, and for comparison with other methods described below. 


Table 1. Race and ethnicity imputation heuristics from genetic ancestry. Super-population codes: AFR, 
Africa; AMR, Americas; EAS, East Asia; EUR, Europe; SAS, South Asia. 


Imputed category Super-population genetic ancestry thresholds 

Hispanic or Latino >10% AMR and >70% combined AMR, EUR, and AFR 

NH Asian >70% combined EAS and SAS 

NH Black >20% AFR, <10% AMR, and >70% combined AFR and EUR 
NH White >80% EUR and <10% AMR 

Complex Remaining patients not meeting above thresholds 


2.5. Machine learning imputation of race and ethnicity 


We also developed machine learning (ML)-based imputation methods, wherein an ML algorithm is 
trained to classify subjects into race and ethnicity categories based on genetic ancestry and other 
inputs. For all models, a single trainttest and validation set was assembled from the 33,232 patient 
records with stated race and ethnicity that fit our imputation categories and with available home 
address 3-digit ZIP code. Features used by these models included genetic ancestry proportions for 
AFR, AMR, EAS, EUR, and SAS; US census division of patient’s home state (nine geographic 
groupings of states defined by the US Census Bureau: Pacific, Mountain, West North Central, West 
South Central, East North Central, East South Central, South Atlantic, Middle Atlantic, and New 
England); and “demographic proportions,” i.e., proportions of Hispanic or Latino, NH Asian, NH 
Black, and NH White residing in each patient’s three-digit ZIP code tabulation area (ZCTA), as 
available from the 2021 5-year American Community Survey and mapped to three-digit ZIP codes 
using UDS Mapper.” We split the traint+test and validation sets 90/10 while maintaining the US 
census division proportions in each set to ensure that the sets were aligned well for populations 
whose genetic ancestry proportions vary by U.S. geography, e.g., Hispanic or Latino.! We 
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evaluated models using three groups of features: 1) ML-ancestry: genetic ancestry proportions only; 
2) ML-ancestry+geolocation: genetic ancestry proportions and US census divisions; 3) ML- 
ancestry+demographics: genetic ancestry proportions and demographic proportions. 

We implemented all machine learning models in R using the caret package (v 6.0.94).°° A 
number of models based on supervised training algorithms were evaluated, including models based 
on the random forest (method=“rf’) and gradient boosting (method=“gbm’’) algorithms. We 
ultimately chose a boosted logistic regression algorithm (method=“LogitBoost”,”* presented here) 
as it provided the ability to make no-call assignments and applied a probabilistic threshold in 
classification. Boosted logistic regression is a supervised machine learning algorithm that utilizes 
negative log-likelihood as a cost function. It iteratively builds decision trees to classify subjects, 
where each iteration is trained on a sample (with replacement) of the data in which subjects who 
were incorrectly classified in the previous round are more frequently sampled. The final classifier 
consists of a weighted combination of decision trees, where trees with lower log loss have more 
weight, and it returns the probabilities of belonging to each category for each subject. We chose to 
assign “No Call” to any subject with all probabilities <0.5. All models were trained using 10-fold 
cross validation. Grid expansion was performed to evaluate boosting iterations from 1 to 100 in 
intervals of 10. The optimal number of iterations and the final model were selected based on the 
lowest log loss value. 


3. Results 


3.1. Comparison of performance of race and ethnicity imputation methods 


Table 2 summarizes the overall performance of the heuristic assignment method and each of the ML 
models. The ML model that utilized combined genetic ancestry proportions and demographic 
proportions (the proportions of the population in a patient’s three-digit ZCTA belonging to Hispanic 
or Latino, NH Asian, NH Black, and NH White) achieved the best mean F1-score (0.957), Cohen’s 
kappa (0.936), correct rate (0.974), log loss (0.122), AUC (0.982), and prAUC (0.946) whereas the 
heuristic method performed the worst by most metrics: mean F1-score 0.939, Cohen’s kappa 0.903, 
correct rate 0.959, weighted correlation 0.876, and weighted error 0.009. The ML model that solely 
considered genetic ancestry proportions achieved the best weighted correlation (0.930) and 
weighted error (0.007), whereas the ML model that included geolocation in the form of the US 
Census district of a patient’s home address state had intermediate performance by most metrics. 


Table 2. Overall performance of race and ethnicity imputation methods for the validation set (N=3,319). 
Metrics that rely on a single classification threshold used a predicted probability of >0.5 for computation. 
Refer to section 2.5 for ML method descriptions. Best performing metric indicated with bold. 


Imputation Method Mean Cohen’s Correct Weighted Weighted Log AUC prAUC 
Fl-Score Kappa Rate Correlation [Error Loss 

Heuristic 0.939 0.903 0.959 0.876 0.009 - - - 

ML-ancestry 0.954 0.934 0.973 0.930 0.007 0.127 0.980 0.930 


ML-ancestry+geolocation 0.955 0.935 0.973 0.926 0.009 0.131 0.979 0.898 
ML-ancestry+demographics 0.957 0.936 0.974 0.928 0.013 0.122 0.982 0.946 
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When evaluating performance by category, we found that recall, precision, and Fl-score were 
all at or above 0.932 for the NH Asian, NH Black and NH White categories (Table 3). Performance 
of all imputation methods was worst for the Hispanic or Latino category, with recall ranging from 
0.859-0.887, precision from 0.833-0.964, and F1-score from 0.859-0.909. 


Table 3. Performance of race and ethnicity imputation methods on validation set (N=3,3 19) per classification 
category. Refer to section 2.5 for ML method descriptions. Best performing metric for each category 
indicated with bold. 


Classification Category, N 


Metric Imputation Method Hispanic or NH Asian, NH Black, NH White, 


Latino, 435 130 301 2,463 

Recall Heuristic 0.887 0.983 0.983 0.966 
ML-ancestry 0.876 0.962 0.993 0.987 
ML-ancestry+geolocation 0.877 0.969 0.983 0.988 
ML-ancestry+demographics 0.859 0.976 0.993 0.990 

Precision Heuristic 0.833 0.935 0.942 0.985 
ML-ancestry 0.938 0.933 0.967 0.981 
ML-ancestry+geolocation 0.941 0.932 0.969 0.981 
ML-ancestry+demographics 0.964 0.932 0.968 0.978 

Fl-Score Heuristic 0.859 0.959 0.962 0.976 
ML-ancestry 0.906 0.947 0.980 0.984 
ML-ancestry+geolocation 0.908 0.950 0.976 0.984 
ML-ancestry+demographics 0.909 0.954 0.980 0.984 


3.2. Performance of heuristic method 


Perhaps unsurprisingly, the heuristic method for assigning race and ethnicity categories based on 
genetic ancestry proportions alone underperformed by all measures as compared to the ML models 
(cf. Table 3). For the Hispanic or Latino category (the most difficult to predict using the selected 
features), the heuristic method did have the highest recall (0.887), but this was achieved at the cost 
of low precision (0.833), also reflected in this method obtaining the lowest F1-score (0.859) for that 
category. The heuristic method did achieve the highest recall, precision, and Fl-score for the NH 
Asian category. Overall, although the heuristic method did not perform as well as the ML method, 
its performance was not far behind, achieving an overall correct classification rate of ~96% 
compared to ~97% for the ML models. The no-call rate (i.e., patients assigned to the “complex” 
category) was 2.5%. 


3.3. Performance of ML-ancestry boosted logistic regression model 


We found that the boosted logistic regression model that utilized only genetic ancestry proportions 
improved upon the heuristic method for all overall performance metrics, with an overall correct 
classification rate of 97.3%. It had lower recall (0.876) but higher precision (0.938) for the Hispanic 
or Latino category than the heuristic method. The model had a recall of 0.962-0.993 for the three 
non-Hispanic categories, indicating that it correctly identifies the vast majority of patients in those 
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categories and is usually correct in its predictions, with precision ranging from 0.933-0.981. The 
no-call rate was very low at 0.7%. 


3.4. Performance of ML models including geolocation and demographics 


Adding geolocation or demographic composition obtained from patients’ home address ZCTA areas 
to the genetic ancestry proportions (WL-ancestry+geolocation and ML-ancestry+demographics) 
slightly improved model performance according to most metrics, yielding a correct classification 
rate of 97.3% and 97.4%, respectively. The ML-ancestry+demographics model had the best overall 
performance by all metrics except the less commonly used weighted metrics, which emphasize 
performance according to the true prevalence of each race and ethnicity category in the validation 
dataset. Individual category performance metrics followed a similar pattern to that of the ML- 
ancestry model. Notably, the ML-ancestry+geolocation model had the best precision for the 
Hispanic or Latino category (0.964), which may be desirable for use cases where correct predictions 
of this category are valued over high recall. The no-call rate was 1.1% for ML-ancestry+geolocation 
and 1.0% for ML-ancestry+demographics. 


3.5. Reclassification of stated race and ethnicity categories by imputation 


We selected the ML-ancestry model for further characterization because of its minimal input needs 
by applying it to the entire labeled dataset, regardless of whether geolocation data was available 
(N=35,229). The resulting confusion matrix (Table 4) compares the imputed categories with the 
stated race and ethnicity from the Tempus database, including the rate of no-calls and the number 
and fraction of misclassified records for each stated category. The confusion matrix for the 
validation dataset mirrors this table in terms of percentages (data not shown). 


Table 4. Confusion matrix comparing imputed race and ethnicity category to stated category for the ML- 
ancestry model on all labeled data, including records without geolocation information (N=35,229). 
Percentage of each stated category and numbers of patients (in parentheses) are indicated in each cell. Total 
percentage and number of misclassified patients for each stated category is given in the last row. 


Stated category 

Imputed category Hispanic or Latino NH Asian NH Black NH White 
Hispanic or Latino 82.3% (4,059) 0.2% (2) 0.5% (18) 0.8% (195) 
NH Asian 0.8% (39) 96.5% (1,285) 0.2% (6) 0.2% (57) 
NH Black 1.8% (91) 0.1% (1) 97.7% (3,231) 0.2% (49) 
NH White 11.4% (560) 2.5% (33) 0.6% (19) 98.5% (25,266) 
No Call 3.7% (180 0.8% (10 1.0% (32 0.4% (96 
Misclassified 14.0% (690) 2.7% (36) 1.3% (43) 1.2% (301) 


Additionally, Figure 1 provides a visual representation of the allocation of patients from their 
stated race and ethnicity to their imputed categories through a flow diagram. 
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The confusion matrix further indicates that the Hispanic or Latino category experienced the 
highest rates of no-calls (3.7%) and misclassifications (14.0%), whereas the NH White category had 
the lowest (0.4% and 1.2%, respectively). The flow diagram in Figure 1 illustrates that most patients 
were assigned to their stated category, with the majority of misclassifications occurring between 
Hispanic or Latino and NH White categories. Nevertheless, the overall misclassification rate of this 
model was very low at 0.9%. 


Fig. 1. Flow diagram showing the relationship between stated (left) and imputed (right) race and ethnicity. 
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categories with the MZ-ancestry model in all labeled data, including records without geolocation information and 
excluding no-calls (N=34,911). 


3.6. Distribution of race and ethnicity categories imputed on unlabeled patients 


We also imputed race and ethnicity categories using the MZ-ancestry model for all patients in the 
cohort (N=132,523) and examined the distribution of availability of race and ethnicity labels across 
categories (Figure 2). A total of 35,229 patients belonged to one of the four imputation categories 
according to their stated race and ethnicity data (“labeled”). There were 62,674 patients with no 
available race or ethnicity data at all (“unlabeled”), and an additional 34,620 with only partial 
information, i.e., either stated race or stated ethnicity (or both) were available, but patients did not 
fall into one of the four imputation categories, most frequently because ethnicity was unavailable 
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(“partially labeled”). Imputed categories had comparable levels of unlabeled data, with the No Call 
and NH Asian categories having the most (53% and 52%, respectively) and NH Black having the 
least (44%). The Hispanic or Latino category had the highest level of labeled data by far (40%) due 
to the definition of that category only requiring a stated ethnicity of “Hispanic or Latino” and 
allowing stated race to be any value, including a missing value. The remaining categories had 22- 
26% labeled data. We observed that about half of each of the NH Asian, NH Black, and NH White 
imputed categories had records with a concordant stated race but a missing ethnicity (data not 


shown). 


100% 


75% 


i Unlabeled 


50% 


Percentage of imputed category 


1,210 Partially labeled 
4,761 26,741 Labeled 
i aen Ea 
0% Ea Ea 
Hispanic or Latino NH Asian NH Black NH White No Call 
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Fig. 2. Counts of patients in the full dataset (N=132,523) by label availability status and race and ethnicity 
category as imputed using the ML-ancestry model. Labeled = stated race and ethnicity are available, and a 
patient falls into one of: Hispanic or Latino, NH Asian, NH Black, or NH White based on this information. 
Unlabeled = neither stated race nor ethnicity is available. Partially labeled = either stated race or ethnicity 
is available, but the patient cannot be placed in one of the four listed categories. 


3.7. Analysis of potential biases 


The dataset used to develop our ML models is heavily imbalanced, with the largest group of patients 
(~74%) having a stated category of NH White, and the smallest group (~4%) having a stated 
category of NH Asian, potentially leading to overfitting to the majority category and biasing model 
performance. To address these potential problems, we evaluated additional models beyond those 
discussed here, wherein each model was trained in the same way except that each train+test set was 
downsampled to require an equal number of patients in each category, matching that of the category 
with the smallest number of patients. However, the downsampled models exhibited worse overall 
performance by all of our metrics and, within each category, had lower F1-scores (data not shown). 
Additionally, the performance metrics computed on the train+test sets during cross-validation were 
only slightly better than those computed on the validation set, alleviating concerns of overfitting. 
Importantly, the performance metrics we considered included metrics broken down by classification 
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category to enable evaluation of whether any particular category was underperforming relative to 
the others. We also considered metrics that are suited to imbalanced data, such as Cohen’s kappa. 


4. Discussion 


Although a direct comparison of our methods with other imputation methods was not possible due 
to the absence of PHI (such as surnames or addresses) in our de-identified dataset, we compared our 
performance to that reported in the literature. Our models consistently and substantially 
outperformed these prior methods.* 3! E.g., the weighted correlation of our ML-ancestry model 
was 24-33 percentage points better, while its weighted error was an order of magnitude lower than 
other methods (Table 5). 


Table 5. Comparison of performance metrics of MZ-ancestry and other imputation methods based on 
metrics reported in the literature. Best performing metric indicated with bold. BISG = Bayesian Improved 
Surname and Geocoding;° CTBF = CT-based full;? CTBR = CT-based reduced.? 


Imputation Method Cohen’s Correct Weighted Weighted Reference 

Kappa Rate Correlation Error 
ML-ancestry 0.934 0.973 0.930 0.007 This study 
BISG 0.58 0.78 0.597 0.089 Xue et al, 2019a? 
CTBF 0.67 0.81 0.668 0.048 Xue et al, 2019a? 
CTBR 0.65 0.81 0.595 0.051 Xue et al, 2019a” 
Random Forest 0.67 0.807 0.672 0.025 Xue et al, 2019b° 


In our study, the category with the lowest recall was Hispanic or Latino, ranging from 86-89%. 
This category also had the highest level of no-calls (3.4% vs. <1%). Prior methods report an even 
more pronounced drop in performance for this category. °! However, the ML- 
ancestry +demographics model provided the best precision (96%) at a good recall rate (86%), while 
the Heuristic method provided the best recall (89%) but at a significantly lower level of precision 
(83%). Although the intended use of the imputation may dictate the best trade-off, we believe that 
precision is the most important feature as minimization of misclassified subjects is generally more 
desirable. The drop in performance in the Hispanic or Latino category may be due to the fact that 
self-affiliation with this category corresponds more with culture and language than with genetic 
similarity,'* with levels of admixture within this group varying widely depending on country of 
origin and among the coasts of the US.!5 

As with all RWD analyses, our work has potential limitations. Differences between patients with 
complete vs. incomplete stated race and ethnicity could affect model training and therefore 
imputation performance. The unequal distribution of imputed categories in labeled and unlabeled 
data suggests that there are indeed some slight differences in the composition of patients who lack 
race and ethnicity data, with imputed NH Asian category most likely to be missing this information, 
but therefore also most able to benefit from imputation. Given the limited numbers of American 
Indian or Alaska Native and Native Hawaiian or Other Pacific Islander individuals in our dataset as 
well as the insufficient public allele frequency information from these groups, we are unable to 
develop models to impute those categories, meaning those individuals will be misclassified, 
typically as Hispanic or Latino and NH Asian, respectively. As the Tempus database grows and 
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additional AIM allele frequencies become available, our model could be retrained to enable 
classification using these additional categories. While the performance of our models on populations 
outside the US is unknown, or indeed with differently ascertained population samples, our results 
suggest that retraining with additional data pertaining to those populations could yield similar 
performance in other settings. 

When developing our race imputation methods, we adhered to established recommendations for 
ethical imputation.** We audited input data for bias, scrutinized methodological choices for potential 
bias introduction, rigorously assessed the accuracy of the imputed data, and our aims are to use this 
data to study or reduce disparities. Our adherence to these guidelines underscores our commitment 
to the responsible use of race imputation in promoting equity in healthcare. 


5. Conclusions 


Addressing racial disparities is pivotal to advancing equity in precision medicine. However, the 
frequent unavailability of data disaggregated by race and ethnicity in RWD can lead to biased 
outcome predictors,** inadequate representation in clinical trials,’ and poorly targeted policies, 
potentially exacerbating disparities.*4 While the ultimate goal is to have complete self-reported data 
for optimal race and ethnicity information, our study highlights the efficacy of using genetic ancestry 
data to impute these categories in a de-identified setting, mitigating the challenge of data sparsity 
for these data in RWD from US populations. Our approach could allow more accurate identification 
of racial disparities in certain healthcare settings where genetic data are available, contributing to 
the development of fair artificial intelligence predictors and more targeted and equitable healthcare 
interventions. 
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Precision medicine, also often referred to as personalized medicine, targets the development of 
treatments and preventative measures specific to the individual’s genomic signatures, lifestyle, and 
environmental conditions. The series of Precision Medicine sessions in PSB has continuously 
highlighted the advances in this field. Our 2024 collection of manuscripts showcases algorithmic 
advances that integrate data from distinct modalities and introduce innovative approaches to extract 
new, medically relevant information from existing data. These evolving technology and analytical 
methods promise to bring closer the goals of precision medicine to improve health and increase 
lifespan. 


1. Introduction 


Precision medicine involves tailoring medical decisions and treatments to individual patients in 
a data-driven manner. The accumulation of medically-relevant and, particularly, molecular data has 
uncovered the potential for mechanistic insight into disease processes facilitating clinical decision 
making. Advances in genomic techniques, e.g. spatial transcriptomics and single cell analysis, have 
further enabled identification of the genetic biomarkers of patient drug responses, susceptibility to 
diseases, and other medically-relevant outcomes. At the same time, the enormous scale of this data 
has stimulated use of novel computational methods, resulting in, e.g., the recent explosion in deep 
learning-based, biological and medical data analysis techniques. 
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While the concept of personalized medicine stretches back nearly two decades — just slightly 
longer than our PSB session — the implementation of precision medicine in practice (still) remains 
in its early stages. Novel technologies require new integrative approaches to advance the state of the 
art in this field. In our 2024 session we feature work from researchers across diverse domains, who 
integrate various omics data to provide valuable insights into disease mechanisms, diagnosis, and 
treatment. In this collection, we explore their cutting-edge advancements in more detail. 


2. Session Contributions 


2.1. Transcriptome and Histopathology Integration 


A number of studies submitted to our session focused on integrating spatial transcriptomics and 
histopathology data and demonstrating the potential of this combination to enhance our understanding of 
tumor biology. Song et al enriched their transcriptome-driven findings by incorporating morphological 
features extracted from histopathology images to enable a comprehensive analysis of tumor architecture via 
feature clustering. Azher et al employed contrastive learning and Graph Convolutional Neural Networks 
(GCN) to predict disease stage, lymph node metastasis, and survival prognosis in cancer patients. 
Meanwhile, Srinivasan et al developed a transformer-based model to shed light on the molecular pathways 
involved in skin aging due to light exposure. Their findings not only contribute to our understanding of 
their chosen conditions but also demonstrate the potential of their approaches for studying other diseases. 


2.2. Spatial Proteomics: Revealing Tissue Microenvironments. 


Wu et al introduced innovative methods to analyze tissue microenvironments at high resolution using 
spatial proteomics. By measuring inferred protein polarity, they identified distinct subpopulations of 
immune cells within tumors, shedding light on potential markers of better prognosis. This approach holds 
promise for identifying patients who may respond favorably to specific treatments. 


2.3. Microbiome Analysis: A Closer Look at Gut Health. 


Sapoval et al proposed a novel metagenomic analysis pipeline that bypasses the need for genome assembly, 
allowing for direct comparisons between patients and healthy controls. This reference-free approach is 
particularly valuable for studying conditions like myalgic encephalomyelitis/chronic fatigue syndrome 
(ME/CFS), where the gut microbiome plays a crucial role. Understanding dysbiosis at this level can help 
identify potential disease markers and therapeutic targets. 


2.4. Polygenic Risk Scores: Bridging Genomics and Disease. 


Cardone et al examined the role of a lymphocyte count PRS (polygenic risk score) in predicting CD4 T-cell 
recovery in individuals with HIV undergoing anti-retroviral therapy. While their findings indicated limited 
PRS impact compared to clinical factors, they underscore the importance of considering multiple variables 
in precision medicine studies. Kember et al focused on improving PRS accuracy for cardiometabolic traits; 
their findings emphasize the need for implementing separate scoring mechanisms for diverse ancestries. 


2.5. Integrative Methods for Clustering, Meta-analysis, Deconvolution, and Network Rewiring. 


Numerous contributions aimed at enhancing integrative methods for meta-analysis, subtype detection, cell 
type deconvolution, and network rewiring. Zhang et al introduced nSEA, an algorithm for unsupervised 
clustering of low grade Gliomas, uncovering a novel subtype with clinical implications. Huang et al 
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proposed a multi-modal clustering approach that combines various data types to cluster tumor samples 
across different cancer types, offering a more holistic perspective on cancer classification. 


In the realm of meta-analysis, Fukutani et al overcame batch effects in gene expression data, highlighting 
the importance of robust analytical techniques in large-scale studies. Sufriyana et al employed data-driven 
ontology inference to uncover novel gene sets relevant to subtypes of preeclampsia, showcasing the power 
of meta-analysis in identifying novel biological processes. 


Deconvolution, a vital tool for deciphering cellular composition from omics data, faced challenges in 
understanding nascent RNA. Maas et al introduced an adaptation for nascent RNA sequencing, addressing 
the nuances of this emerging field. 


Finally, Dannenfelser et al explored how alternative splicing rewires protein interaction networks in cancer. 
Their Splitpea method provides insights into the complex interplay between alternative splicing and 
disease, offering a novel perspective on cancer biology. 
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Spatial transcriptomics (ST) represents a pivotal advancement in biomedical research, enabling the 
transcriptional profiling of cells within their morphological context and providing a pivotal tool for 
understanding spatial heterogeneity in cancer tissues. However, current analytical approaches, akin 
to single-cell analysis, largely depend on gene expression, underutilizing the rich morphological 
information inherent in the tissue. We present a novel method integrating spatial transcriptomics and 
histopathological image data to better capture biologically meaningful patterns in patient data, 
focusing on aggressive cancer types such as glioblastoma and triple-negative breast cancer. We used 
a ResNet-based deep learning model to extract key morphological features from high-resolution 
whole-slide histology images. Spot-level PCA-reduced vectors of both the ResNet-50 analysis of the 
histological image and the spatial gene expression data were used in Louvain clustering to enable 
image-aware feature discovery. Assessment of features from image-aware clustering successfully 
pinpointed key biological features identified by manual histopathology, such as for regions of fibrosis 
and necrosis, as well as improved edge definition in EGFR-rich areas. Importantly, our combinatorial 
approach revealed crucial characteristics seen in histopathology that gene-expression-only analysis 
had missed. 


Supplemental Material: 
https://github.com/davcraig75/song_psb2014/blob/main/SupplementaryData.pdf 


Keywords: Spatial transcriptomics; Deep learning; Image-aware clustering 


Introduction 


Mapping the spatial organization of genes and cells in tissues is the foundation for understanding 


higher-level molecular and cellular processes driving disease pathogenesis. In the past decade, 
paradigm-shifting approaches such as single-cell RNA-seq (scRNA) have provided unprecedented 
insights into cellular populations. More recently, Spatial Transcriptomics (ST) methods have 
emerged (e.g., Visium ST), providing a view of cellular RNA expression and disease pathology in 
the-context-of neighboring-cetls and structures'. Instead of measuring a single bulk transcriptome 
from a tissue section, ST obtains thousands of transcriptomes across a tissue section at spatially 
distinct spots, where each spot covers a few cells with Visium*, or provides sub-cellular data as with 
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MERFISH* and Xenium. These emerging ST technologies have unlocked unprecedented 
possibilities for exploring the transcriptomic architecture of multicellular organisms, revealing 
intricate cellular heterogeneity in diverse tissues and disease states’. 

However, analytical methodologies that do not take into account spatial context or the 
underlying histopathology frequently limit the full potential of these potent technologies. In the case 
of ST analysis methods, many are extensions of earlier strategies that lack direct incorporation of 
spatial information between spots or not to directly leverage the underlying imaging data of protein 
and cellular structure. For example, current clustering techniques, such as Louvain and k-means 
clustering provided by Seurat®, primarily focus on gene expression’, often neglecting spatial context 
and the potential complementary information that can be gleaned from tissue morphology. This 
incomplete fusion of transcriptomic and morphological data limits our ability to fully understand 
the cellular ecosystem within tissues, particularly in cancer related states. In particular groups, 
efforts are being made to integrate imaging data in order to better capture the richness of information 
embodied in high resolution H&E images’”!°. 

In our exploration of the utility of advanced analytical methods for spatial transcriptomics, we 
developed a novel approach we termed "stMIC" (Spatial Transcriptomics and Morphological 
Integrated Clustering). Central to stMIC's design is the incorporation of a form of Convolutional 
Neural Network (CNN) deep learning, specifically the Residual Network-50 or ResNet-50, which 
is characterized by its versatility and effectiveness across a wide array of applications*!'. The 
Resnet-50 architecture uses the concept of residual connections, and, with its existing prior training, 
is highly effective for image classification, object detection, and image segmentation'*. To underline 
the clinical implications of our study, we deployed our method on previously histopathological 
assessed disease specimens, with a particular focus on aggressive malignancies such as glioblastoma 
and triple-negative breast cancer. 


2. Method 
2.1. Visium Spatial Gene Expression Assay, Sequencing, and Preprocessing 


Freshly frozen, OCT-embedded tissues were cryosectioned and mounted on Visium spatial gene 
expression slides (10x Genomics, #1000184), which contain four 6.5 mm * 6.5 mm capture areas 
comprising 5,000 barcoded spatial features each. Hematoxylin and eosin (H&E) staining was 
applied, and microscopic images were obtained subsequently with a Zeiss Axioscan2 microscope 
using a 10x objective. After staining, tissues underwent a permeabilization process to facilitate RNA 
binding to the slide surface, which was determined using the Spatial Tissue Optimization procedure 
(10x Genomics, #1000193). The on-slide cDNA synthesized from immobilized RNA was used to 
generate sequencing libraries, which were paired-end sequenced on an Illumina NovaSeq 6000 
instrument to produce a minimum of ~250 million read pairs per sample!?. The Triple-Negative 
Breast Cancer (TNBC) sample utilized in our study was originally characterized using Spatial 
Transcriptomics (ST) in the work of Bassiouni et al’. 

The Space Ranger pipelines (version 1.1.0; 10x Genomics) were employed to preprocess the 
sequencing data. Demultiplexing of BCL data and conversion to FASTQ format were accomplished 
using the spaceranger mkfastq pipeline. Further, the spaceranger count pipeline enabled read 
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alignment to the human reference genome GRCh38, UMI counting, and the generation of feature- 
spot matrices corresponding to the microscopic tissue image. This pipeline also provided automatic 
tissue detection and fiducial alignment based on the image. Raw gene expression data underwent 
Counts Per Million (CPM) normalization, and subsequent log transformation, followed by scaling 
the data to zero mean and unit variance. All these steps were completed using Scanpy 
(version:1.6.0).!4 


2.2. Histopathological Image Annotation and Evaluation 


Frozen tissues from each block selected for study were stained with H&E. Images derived from 
the Visium slides were examined and annotated by a pathologist utilizing Adobe Photoshop 
software'>. Regions characterized by blood vessel, necrosis, dense immune cell infiltrates, or 
stromal fibrosis were indicated when applicable. 


2.3. Histological Image Segmentation and Patch Selection 


The primary stage of image preprocessing entailed segmenting whole-slide H&E histological 
images from patient samples into smaller patches. Patches were chosen such that each entirely 
encompassed corresponding spots under the tissue, as indicated in the second column of the 
“tissue_positions list.csv” if it is 1. For each spot S;, a corresponding patch was defined, with the 
geographic center of the patch aligned with the center of the spot, denoted by coordinates (x; yi) from 
the last two columns in “tissue_positions list.csv” file. The patch dimensions were such that both 
the height and width were equivalent to the diameter of the spot, d, from scalefactors_json.json. 
Thus, the boundaries of the patch were formally determined by the following coordinates: the upper 
boundary at (x;+d/2), the lower boundary at (x;-d/2), the left boundary at (y:-d/2) , and the right 
boundary at (yit+d/2). 


2.4. Feature Extraction Using ResNet-50 


We implemented a convolutional neural network (CNN) model for feature extraction from each 
patch. We utilized a pretrained ResNet-50 model (Tensorflow version: 2.6.0) trained on the 
ImageNet dataset for optimal performance in our task. Specifically, it was employed with its top 
fully-connected layer excluded, selecting "avg" pooling mode for feature extraction (Tensorflow 
version: 2.6.0), and everything else was default setting. The segmented histology image patches, 
read in using OpenCV (version: 4.5.3) and resized to (224,224,3), served as inputs. The ResNet-50 
model subsequently outputted a 2048-dimensional feature array that represented the patch's 
morphological features. Feature standardization was achieved using StandardScaler from sklearn 
(version 0.22.1), which removes the mean and scales to unit variance. 


2.5. Integration of Matrices and Clustering 


Both the normalized gene expression matrix and morphological feature matrix underwent 
Principal Component Analysis (PCA), separately processed through the top 10 principal 
components. The resulting matrices were concatenated based on the corresponding barcode of the 
spots. Clustering was accomplished using the Louvain modularity optimization algorithm with a 
resolution range of 1.5-1.9 and k-neighbors set at 39, implemented in Orange Data Mining (version 
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3.30.1). This configuration yielded a stable set of clusters, paralleling the cluster numbers obtained 
through the current analytical approach solely based on gene expression. 


2.6. Evaluation Measures 


The performance of our method was evaluated using multiple validation metrics. The 
Adjusted Rand Index (ARI) served as the initial metric, quantifying the similarity between clustering 
assignments relative to the pathologist's annotation!>. Gene set enrichment analysis (GSEA) was 
then performed at the cluster level via the Broad Institute’s GenePattern software 
(RRID:SCR_003199). Utilizing the FindAllMarkers feature of Seurat (version: 4.3.0.1), 
differentially expressed genes within chosen clusters were identified via Wilcoxon rank sum testing. 

Gene lists from each selected cluster were subjected to Pre Ranked GSEA, contrasting against 
chosen gene sets (H: hallmark gene sets!®, C2:CP:KEGG!’, C4!8, C7: immunesigdb!’’), with 
permutations set at 1000 and the collapse dataset selected as “Remap_ only”. The final step involved 
visualizing spatial expression patterns for genes of interest using the SpatialPlot feature in Seurat. 


2.7. Implementation 


stMIC has been developed with Python 3.7 as a user-friendly pipeline. Setting up and tutorials 
are described in the stMIC GitHub page: https://github.com/USCDTG/stMIC. 


2.8. Supplemental Material 


Supplemental Material referred to in the paper may be found at the following URL: 
https://github.com/daveraig75/song_psb2014/blob/main/SupplementaryData.pdf 


3. Results 
3.1. Pipeline 


Our primary goal was to enhance identification of biological features from 10X Visium ST by 
incorporating deep learning analysis. We first show the default approach in Figure 1A, where “spot- 
level” normalized gene expression obtained via the Space Ranger pipeline is first reduced by 
principal component analysis (PCA) from (spots x genes) to (spots x M) where M is 10. This is 
frequently followed by graph-based clustering using a sparse nearest neighbor graph, followed by 
Louvain Modularity Optimization to identify highly-connected modules in the graph. We note that 
other clustering methods, such as K-means, are used and are presenting the default clustering 
method of Space Ranger. 

Our stMIC pipeline is shown in Figure 1B and includes partitioning or splitting the ST 
histological image into segmented tissue spots, followed by feature extraction in the ResNet-50 
model on the underlying image for each spot. Thus, if there were 4,096 passing spots, we would 
have the same number of images. Computationally, this step took approximately 2.5-5 minutes per 
whole slide image on a machine with an NVIDIA Tesla T4 GPU with 16GB of VRAM, depending 
on the number of spots under tissue. The gene expression data was simultaneously processed with 
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the Space Ranger pipelines, turning raw sequencing data into normalized feature-spot matrices 
within 3 hours on a standard bioinformatics workstation with 32 CPU cores. 

Following dimensionality reduction of both morphological and gene expression data via 
Principal Component Analysis (PCA), we concatenated and performed clustering using the 
Louvain algorithm, an operation that took roughly 1-2 minutes on the same workstation. To evaluate 
our method's performance, we used multiple validation metrics, including the Adjusted Rand Index 
(ARI) and Gene Set Enrichment Analysis (GSEA). 


A. Default Spatial Transcriptomics Clustering (Space Ranger) 


Image Registration Filtered + Normalized Dimensionality 
UMI Counting Expression Data Reduction _ 


Clustering 


t-SNE 


Louvain ` ? 
\ Clustering oe 
— 


B. Enhancing Spatial Transcriptomics Analysis by Integrating Image-Aware Deep Learning 
Image Partitioning 


Analysis by Deep Dimensionality Clustering 
Learning Reduction 


Morphological features Top 10 PCs 


Resnet 50 


Louvain 
Clustering 


Combined 
Matrix 


Filtered + Normalized 
Expression Dat 


© 
a 
o 


Gene Expression Top 10 PCs 


Fig. 1. (A) Default spatial transcriptomics clustering approach, e.g. used in 10X Space Ranger Pipeline. 
(B) stMIC: ST analysis integrating image-aware deep-learning analysis. High-resolution histology 
images undergo an initial cropping process into smaller patches, driven by the location and size of 
spots within the tissue. Subsequently, these patches are introduced to a deep learning model, ResNet- 
50, resulting in the production ofa morphological feature matrix. Principal Component Analysis (PCA) 
is applied to this matrix, from which the top 10 principal components (PCs) are selected. A parallel 
procedure is enacted on the gene expression matrix derived from the spatial transcriptomics dataset. 
The reduced matrices from both the morphological and gene expression data are then concatenated at 
each spot to form a unified matrix. 
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3.2. Application to human glioblastoma spatial transcriptomics data 


In the exploration of the pretrained ResNet-50 model's proficiency, we started with a 
representative glioblastoma sample, FFD1. This sample encompasses 3,594 spots and 33,538 genes, 
obtained from the 10x Genomics Visium platform. The analysis commences by contrasting the 
clustering outcomes from both the Louvain methodology, which is only dependent on gene 
expression (GEBC) (Fig. 2B), and the approach leveraging H&E histology image feature extraction 
via ResNet-50 (Fig. 2C). For enhanced visual interpretability, each cluster is assigned a unique 
color. 

Further exploration of the three most significantly differentially expressed genes (DEGs) within 
this cluster, HBB, HBA 1, and HBA2, revealed their critical role in blood biochemistry. HBB encodes 
the beta-globin protein, while HBA/ and HBA2 code for the alpha-globin protein, forming essential 
components of hemoglobin”. Predominantly, the expression of these genes was concentrated within 
the region defined as cluster 10 (Fig. 2D), affirming the blood vessel identity of this cluster. 


B. Gene Expression Based Clustering D. Expression of HBA1+HBA2+HBB (Globin) 
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Fig. 2. Comparative analysis of clustering methods and spatial expression patterns of FFD1. 
(A) Haematoxylin and eosin (H&E) staining image of Glioblastoma sample. (B) Spatial 
domains identified by gene-expression-based Louvain clustering method. (C) Spatial domains 
identified by ResNet-50 feature extraction method. (D) Spatial distribution of the top 
upregulated genes in cluster 10 (HBA1, HBB, HBA2). (E) Spatial domains identified by stMIC 
method. 


These findings attest to the remarkable capability of ResNet-50 in uncovering areas that remain 
undetected by the gene-expression-based methodology, which relies strictly on gene expression. 
These advantages provide a solid foundation for developing an even more robust and comprehensive 
analysis method, stimulating the formulation of an integrated approach. Motivated by this, we 
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proceeded to implement an advanced strategy that harnesses both histological and gene expression 
information. This integrative method not only enhances our ability to discern various tissue regions, 
but also seeks to offer a more nuanced, multidimensional view of the complex histopathological 
landscape. The harmonization of image-derived and transcriptomics data enables us to move beyond 
the limitations of each individual data type, allowing a more holistic exploration of biological 
phenomena at the tissue level. Importantly, stMIC strategy not only successfully pinpoints the blood 
vessel area, but also enhances edge definition in this key area (Fig. 2E), thus increasing precision in 
key region detection. These results highlight the potential of the integrated image-aware method 
methodology for providing comprehensive and accurate histopathological profiling. 

We set the number of clusters at ten for all methods to compare the clustering results of SpaGCN, 
SpaCell, and stLearn. SpaCell, which employs an autoencoder for dimension reduction, failed to 
detect the EGFR-rich region (not detailed in the main results) and did not align closely with the 
pathologist-annotated blood vessel region as depicted in Supplementary Fig. 1D. Notably, both 
SpaGCN and stLearn in their default implementation were also unable to identify the blood vessel 
region, as shown in Supplementary Fig 1.E&F. Still any interpretation of features missed or seen 
should be taken with caution since these types of features were not part of their development. 
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Fig. 3. Comparative analysis of clustering methods and spatial expression patterns of Slide 120D. 
(A) Histology and manually annotated structure for slide 120D. (B) Spatial domains identified by 
gene expression-based clustering method. (C) Spatial domains identified by ResNet-50 feature 
extraction method. (D) Spatial domains identified by stMIC method. (E) Adjusted Rand Index 
(ARI) in stMIC, gene expression based (GEBC), and ResNet clustering methods determined 
sections against the ground truth labels (pathologist annotation). (F) Spatial expression of the 
fibrosis marker gene. (G) boxplot of GSEA for cluster 4 of sample 120D against selected gene sets. 
NES = Normalized enrichment score. 
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3.3. Application to human triple-negative breast cancer spatial transcriptomics data 


In our evaluation of the triple-negative breast cancer sample, termed Slide 120D, ResNet 
enhanced the performance in distinguishing between fibrotic and necrotic regions, a result that is 
corroborated by pathologist annotations (Fig. 3A,C). The gene expression-based method could not 
differentiate these distinct regions, splitting the fibrotic region into two clusters (Fig. 3B). Using an 
approach previously applied to a glioblastoma sample, we employed our integrated image-aware 
method, stMIC. Consequently, our stMIC method not only improved clustering of the fibrotic and 
necrotic regions but also significantly enhanced the accuracy in identifying the dense lymphoid 
infiltrate region (Fig. 3D). To assess and compare the performance of two clustering methods, the 
Adjusted Rand Index (ARI) was employed as an evaluation metric. The ARI is a widely used 
measure that quantifies the similarity between two clustering assignments by comparing their 
agreement with respect to the ground truth, in this case, the pathologist's annotation. Clustering 
based on the integration of morphological features with gene expression was more consistent with 
the regional annotations obtained by pathologists (ARI = 0.2536) compared to gene expression- 
based clustering method (ARI = 0.1796) and ResNet-50 feature extraction clustering (ARI = 0.1793) 
(Fig. 3E). To discern each method's proficiency in detecting intricate tissue structures, we observed 
stLearn's clustering closely mirrored that of the gene expression-based Louvain clustering. Both 
stLearn and SpaGCN does not segregate the fibrosis and necrosis regions (Supplementary Fig 
2.E&F). Specifically, SpaCell managed to identify almost the entire fibrosis region but did not fully 
differentiate between these two crucial areas (Supplementary Fig. 2D). The Adjusted Rand Index 
(ARI) is 0.173 for SpaGCN, 0.151 for stLearn, and 0.125 for SpaCell (Supplementary Fig 2H). At 
some level, these results must be taken with caution since these tools had not been evaluated or 
designed for these types of pathologies. 

From a biological point of view, improved clustering is supported by the fact that the key marker 
tenascin-XB (TNXB) of fibrosis shows high differential expression in cluster 4 (Fig. 3D&F). 
Tenascin-XB, a key component of the extracellular matrix, has been linked to tissue remodeling and 
fibrosis, and is often upregulated in fibrotic tissues”! . Its elevated expression and localization within 
the fibrotic region not only strengthen the characterization of this region but also implies the ongoing 
process of tissue remodeling - a common event in fibrosis”. This discovery again emphasizes the 
capacity of our integrated method to reveal crucial biological elements and events, contributing to 
a more comprehensive understanding of tumor progression. 

Furthermore, stMIC approach proved better in capturing biologically relevant features within 
the detected fibrotic region compared to the gene expression-based clustering method. Gene set 
enrichment analysis revealed significant distinctions between the two methods (Fig. 3G). Only 
results with a false discovery rate (FDR) < 0.05 are displayed. While no significant pathways were 
detected using the expression-based method, our integrated approach identified four significantly 
enriched pathways. These pathways were indicative of active immunity, antigen processing and 
presentation, as well as humoral immune and inflammatory responses, all with Normalized 
Enrichment Scores (NES) greater than 1.5 (Supplementary Fig. 2G). These findings demonstrate 
the profound immune involvement within the fibrotic area, further underscoring the added value of 
stMIC in capturing these nuanced dynamics. 
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In another triple-negative breast cancer sample, 094D, in a horizontal comparison of results from 
the gene expression based, ResNet-50, and stMIC methods with annotation, stMIC approach 
unveiled critical biological phenomena (Fig. 4A-D). Notably, it achieved higher accuracy in 
capturing the middle right fibrotic region, a conclusion that was further corroborated by the higher 
Adjusted Rand Index (ARI) in the stMIC (ARI = 0.1479) vs. gene-expression alone (0.1111) or 
ResNet feature clustering (0.1051). In a comparison between the clustering results of stMIC and 
spaCell, Figure 4G indicates SpaCell did not differentiate between these sub-clusters 
(Supplementary Fig. 3D). While both spaGCN and stLearn could differentiate these sub- 
clusters((Supplementary Fig 3.E&F), their performance was not markedly superior to stMIC. When 
measured against pathologist annotations using the ARI), SpaGCN, stLearn and SpaCell recorded 
ARIs of 0.098, 0.127 and 0.091, respectively (Supplementary Fig 3H). 


In-depth analysis of marker genes, facilitated by the stMIC approach, revealed notable features 
absent from gene-expression-only clustering. Specifically, our approach discerned that what was 
identified as cluster 1 in gene-expression-only clustering comprised two distinct, biologically 
relevant clusters. One of these exhibited increased expression of hypoxia markers CA9/NDRGI, 
while the other was characterized by the presence of IFIT1, a marker indicative of an active 


interferon response (Fig. 4 F-H). 
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Fig. 4. Comparative analysis of clustering methods and spatial expression patterns of Slide 94D. (A) 
Histology and manually annotated structure for slide 94D. (B) Spatial domains identified by gene- 
expression-based Louvain clustering method. (C) Spatial domains identified by ResNet-50 feature 
extraction method. (D) Spatial domains identified by stMIC method. (E) Adjusted Rand Index (ARI) 
in stMIC, gene-expression (graph-based), and ResNet-50 clustering methods determined sections 
against the ground truth labels (pathologist annotation). (F-G) Spatial expression of hypoxia marker 
genes. (H) Spatial expression of hypoxia marker gene. (I) Boxplot of GSEA for cluster 4 of sample 
94D against selected gene sets. 
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The genes CA9 (Carbonic Anhydrase IX) and NDRGI (N-Myc Downstream Regulated 1) (Fig. 
4F&G) are known to be upregulated under hypoxic conditions, which often occur in solid tumors 
such as breast cancer due to inadequate oxygen supply”***. This upregulation is a response to the 
low oxygen tension in an attempt to adapt to the harsh microenvironment. Hypoxia within tumors 
is associated with increased invasiveness, resistance to therapy, and a poor prognosis, thus indicating 
a potentially more aggressive disease state within this cluster”. 

In addition to these hypoxia markers, stMIC discerns a separate region with high expression of 
IFIT1 (Fig. 4H). This interferon-induced protein has been associated with active interferon signaling 
and an ongoing immune response. Active interferon signaling can have complex implications in 
cancer, possessing both tumor-suppressive and tumor-promoting properties”®. Meanwhile, an 
ongoing immune response may reflect the immune system's attempts to counteract tumor 
progression or could even suggest the shaping influence of the immune system on the tumor's 
behavior”. Beside this, JFJT/ has been associated with chemotherapy response in breast cancer’®, 
suggesting that this region might be more susceptible to chemotherapy. 

In fact, previous studies have validated the relevance of these two distinct clusters, but this was 
determined only through a comprehensive joint expression analysis spanning 28 diverse samples. 
Thus, while expression-analysis does identify these sub-clusters, it is reliant on the comparative 
context gleaned from multiple, external samples. Furthermore, our stMIC approach has identified 
significantly more enriched pathways in cluster 4 than what the gene-expression-based method 
alone was able to capture (Fig.4 I). The gene set enrichment score of top five Hallmark pathways 
are further compared between gene expression based vs. stMIC clustering method (Supplementary 
Fig.3E). Only results with a false discovery rate (FDR) < 0.05 are displayed. This difference 
underscores the potential of the integrated image-aware approach to provide a richer, more nuanced 
analysis of gene expression within specific tissue regions. In conclusion, our integrated image-aware 
approach has provided new insights into the spatial heterogeneity of the 094D triple-negative breast 
cancer sample, unraveling the complexities of hypoxia and immune responses within the tumor 
environment. The incorporation of morphological information in gene expression analysis can 
enhance the resolution of tumor substructure identification and pave the way for a more nuanced 
understanding of tumor biology, with potential implications for treatment strategies. 


4. Discussion 


Our results showed how integrating a naive pre-trained ResNet-50 into Spatial Transcriptomics 
workflows identified features missed from standard ST gene-expression analysis. Second, our 
results show an improved ability to recapitulate features identified by a pathologist. 

An image-aware, integrated approach can identify previously overlooked or undervalued 
features. The most prominent feature identified from the combined approach was the glioma's blood 
vessel and vascularization features. These features are unmistakable in the histology image. 
However, only three genes (HBA, HBA1, and HBA2) within this biological feature have significant 
expression within the gene expression data. While this is expected from globin-producing 
erythrocytes, these three genes did not contribute significantly to the first 10 PCs used in gene 
expression clustering. In practice, given that hallmarks of the disease involve post-translational 
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modifications, many types of features would be less evident by gene expression alone. The ability 
to highlight these features underscores the power of stMIC approach that draws from the strengths 
of deep learning and the rich data afforded by spatial transcriptomics. 

Clear identification of novel features and improved clustering is encouraging, given that there 
are several areas where the approaches used here could be improved or optimized. First, we utilized 
a naive, pre-trained ResNet-50 model. Training approaches could lead to improved clustering. 
However, histopathological image data often have well-described biases that limit transferability °’. 
Indeed, some groups have identified approaches for internal training, as in the case of the tool stMIC, 
which shows promising early results in disease relevant contexts*°. Thus, the value seen even in this 
naive model is notable, as improvement was seen without further training. 

Our approach differs significantly from RESEPT?!, stLearn**, and SpaCell?, which are well 
established pipelines using the ResNet-50 model in spatial transcriptomics analysis. RESEPT uses 
a graph autoencoder to embed ST gene expression data into a three-dimensional representation and 
maps this to an RGB image for visual analysis using the ResNet101 deep learning model. Unlike 
RESEPT, which does not consider H&E image data, our approach integrates both spatial gene 
expression and H&E image data to capture comprehensive morphological and transcriptional 
details, offering the potential for richer biological insights. Like stLearn, our pipeline utilizes a pre- 
trained ResNet-50 model, but we diverge by integrating the image data with gene-expression 
clustering, rather than just normalization, offering a richer exploration of tissue features. By 
comparison, SpaCell also uses a ResNet-50 model for feature extraction and integrates imaging with 
transcriptomics; they differ by using autoencoders to reduce features and mainly focusing on 
classification/prediction. 

Within this work, we leveraged pathology annotated samples to assess performance, and very 
importantly, while some methods employ models on non-pathological systems, such as Spatial 
Transcriptomics (ST) on a mouse cortex or similar non-disease settings, their utility can be limited 
in capturing the full spectrum of cellular interactions in aggressive human diseases like cancer. 
Instead, our approach prioritizes pathological samples from patients, given their inherent complexity 
and heterogeneity. By targeting unique pathological landmarks and intricacies, our method seeks to 
illuminate tumor progression and identify key biological elements and events. While we can see 
improved results using pathology annotation as a benchmark, it has limitations. First, these samples 
were predominantly done on fresh-frozen tissue, whereas pathologists typically prefer annotation 
on FFPE fixed tissues. With the emergence of spatial transcriptomics approaches that work with 
FFPE tissue, these ideally annotated histology images will become more available; however, FFPE 
usually comes with lower quality ST due to degraded RNA. 

A deeper dive into understanding the exact mechanisms by which ResNet-50 outperforms solely 
gene-expression-based approaches would be beneficial. Unraveling the strengths and limitations of 
image-aware deep learning model in the context of spatial transcriptomics can provide a foundation 
for optimizing or even designing new architectures that can more comprehensively capture 
pathology-related features. We recognize this as a crucial next step and are planning further 
investigations to elucidate the specific advantages of these models and to guide the development of 
even more effective methodologies. 
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Some advanced deep learning architectures have emerged that could potentially enhance the 
methods presented in this study. The Vision Transformer (ViT)*?, for instance, processes images by 
segmenting them into fixed-size patches and then leverages the Transformer architecture, offering 
the potential to extract more nuanced spatial relationships crucial for spatial transcriptomics. 
Similarly, the MLP-Mixer™, through its unique mixing of tokens with multilayer perceptrons, and 
the Swin Transformer*, with its shifted windows approach, can be particularly advantageous for 
capturing intricate spatial hierarchies and features from histology images. Our study predominantly 
utilized the ResNet-50 model for its proven efficacy in image analysis. However, integrating recent 
architectures like ViT, MLP-Mixer, and Swin Transformer might allow for a more comprehensive 
feature extraction, bridging image-based nuances with spatial transcriptomic insights. 

Several potential extensions and improvements could be made to the presented approaches. 
Consideration could be given to incorporating other types of omics data to enrich the data pool 
further. Refinement of the deep learning model could enhance performance by integrating new 
layers or algorithms. A promising avenue to explore is the incorporation of spatial distances between 
spots into the analysis, offering a more nuanced understanding of cellular organization. 

A broader sample set and more detailed annotations would certainly strengthen the robustness 
of our model. The requirement for specialized technical expertise to operate and interpret results is 
another hurdle that needs to be overcome. Furthermore, validating our methodology using a more 
comprehensive range of patient-derived data is necessary to ascertain the model's clinical relevance 
and translational applicability. 

Beyond cancer, the potential application of our approach to other diseases warrants further 
exploration. With the in-depth view of the tumor microenvironment that our method provides, we 
foresee a crucial role for it in the realm of personalized medicine, particularly given advances in 
immune-oncology treatments. As our understanding of cellular heterogeneity within tissues 
becomes increasingly nuanced, an image-informed ST approach will likely serve as a powerful tool 
in understanding the role of the immune microenvironment. This work underscores the exciting 
potential of spatial transcriptomics and deep learning in shaping the future of understanding disease 
heterogeneity. 


Availability of data and materials. 


The stMIC code has been developed with Python 3.7 as a user-friendly pipeline. Code, setting 
up and tutorials are described in the stMIC GitHub page: https://github.com/USCDTG/stMIC. 

Supplemental Material referred to in the paper may be found at the following URL: 
https://github.com/daveraig75/song_psb2014/blob/main/SupplementaryData.pdf 


461 


Pacific Symposium on Biocomputing 2024 


References 


l. Rao, A., Barkley, D., França, G. S. & Yanai, I. Exploring tissue architecture using spatial 
transcriptomics. Nature 596, 211—220 (2021). 

2: Asp, M., Bergenstrahle, J. & Lundeberg, J. Spatially Resolved Transcriptomes-Next 
Generation Tools for Tissue Exploration. Bioessays 42, e1900221 (2020). 

3. Vickovic, S. et al. High-definition spatial transcriptomics for in situ tissue profiling. Nat. 
Methods 16, 987-990 (2019). 

4. Chen, K. H., Boettiger, A. N., Moffitt, J. R., Wang, S. & Zhuang, X. RNA imaging. 
Spatially resolved, highly multiplexed RNA profiling in single cells. Science 348, aaa6090 (2015). 
5. Li, Q., Zhang, X. & Ke, R. Spatial Transcriptomics for Tumor Heterogeneity Analysis. 
Front. Genet. 13, 906158 (2022). 

6. Stuart, T. et al. Comprehensive Integration of Single-Cell Data. Cell 177, 1888—1902.e21 
(2019). 


Ti Edsgärd, D., Johnsson, P. & Sandberg, R. Identification of spatial expression trends in 
single-cell gene expression data. Nat. Methods 15, 339-342 (2018). 
8. Hu, J. et al. SpaGCN: Integrating gene expression, spatial location and histology to 


identify spatial domains and spatially variable genes by graph convolutional network. Nat. 
Methods 18, 1342-1351 (2021). 

9, Tan, X., Su, A., Tran, M. & Nguyen, Q. SpaCell: integrating tissue morphology and spatial 
gene expression to predict disease cells. Bioinformatics 36, 2293—2294 (2020). 

10. Ifvarsson, Falk, Vidman & Thorén. [Social Welfare Department’s general advice on 
prevention, diagnosis and treatment of eye infections in newborn infants]. Jordemodern 99, 398— 
403 (1986). 

11. Website. https://doi.org/10.48550/arXiv.1512.03385 doi:10.48550/arXiv.1512.03385. 

12. Kutluer, N., Solmaz, O. A., Yamacli, V., Eristi, B. & Eristi, H. Classification of breast 
tumors by using a novel approach based on deep learning methods and feature selection. Breast 
Cancer Res. Treat. 200, 183—192 (2023). 

13. Bassiouni, R. et al. Spatial Transcriptomic Analysis of a Diverse Patient Cohort Reveals a 
Conserved Architecture in Triple-Negative Breast Cancer. Cancer Res. 83, 34—48 (2023). 

14. Wolf, F. A., Angerer, P. & Theis, F. J. SCANPY: large-scale single-cell gene expression 
data analysis. Genome Biol. 19, 15 (2018). 

15.  Steinley, D. Properties of the Hubert-Arabie adjusted Rand index. Psychol. Methods 9, 
386-396 (2004). 

16. — Liberzon, A. et al. The Molecular Signatures Database (MSigDB) hallmark gene set 
collection. Cell Syst 1, 417—425 (2015). 

17. Kanehisa, M. & Goto, S. KEGG: kyoto encyclopedia of genes and genomes. Nucleic 
Acids Res. 28, 27—30 (2000). 

18. Segal, E., Friedman, N., Koller, D. & Regev, A. A module map showing conditional 
activity of expression modules in cancer. Nat. Genet. 36, 1090-1098 (2004). 

19. Godec, J. et al. Compendium of Immune Signatures Identifies Conserved and Species- 
Specific Biology in Response to Inflammation. Immunity 44, 194—206 (2016). 


462 


Pacific Symposium on Biocomputing 2024 


20. Richter, F., Meurers, B. H., Zhu, C., Medvedeva, V. P. & Chesselet, M.-F. Neurons 
express hemoglobin alpha- and beta-chains in rat and human brains. J. Comp. Neurol. 515, 538— 
547 (2009). 


21. Cohen, C. et al. The roles of Tenascin C and Fibronectin 1 in adhesive capsulitis: a pilot 
gene expression study. Clinics 71, 325-331 (2016). 
22. Caja, L. et al. TGF-B and the Tissue Microenvironment: Relevance in Fibrosis and Cancer. 


Int. J. Mol. Sci. 19, (2018). 

23. Kuhlenbaumer, G., Stögbauer, F., Ringelstein, E. B. & Young, P. Hereditary Peripheral 
Neuropathies. (Springer Science & Business Media, 2005). 

24. Shamis, S. A. K., Edwards, J. & McMillan, D. C. The relationship between carbonic 
anhydrase IX (CAIX) and patient survival in breast cancer: systematic review and meta-analysis. 
Diagn. Pathol. 18, 46 (2023). 

25. Jing, X. et al. Role of hypoxia in cancer therapy by regulating the tumor 
microenvironment. Mol. Cancer 18, 157 (2019). 

26. Minn, A. J. Interferons and the Immunogenic Effects of Cancer Therapy. Trends Immunol. 
36, 725-737 (2015). 

27. Hiam-Galvez, K. J., Allen, B. M. & Spitzer, M. H. Systemic immunity in cancer. Nat. Rev. 
Cancer 21, 345-359 (2021). 

28. Weichselbaum, R. R. et al. An interferon-related gene signature for DNA damage 
resistance is a predictive marker for chemotherapy and radiation for breast cancer. Proc. Natl. 
Acad. Sci. U. S. A. 105, 18490-18495 (2008). 

29.  Hagele, M. et al. Resolving challenges in deep learning-based analyses of 
histopathological images using explanation methods. Sci. Rep. 10, 6423 (2020). 


30. Zuo, C. et al. Elucidating tumor heterogeneity from spatially resolved transcriptomics data 
by multi-view graph collaborative learning. Nat. Commun. 13, 5962 (2022). 
31. Chang, Y. et al. Define and visualize pathological architectures of human tissues from 


spatially resolved transcriptomics using deep learning. Comput. Struct. Biotechnol. J. 20, 4600— 
4617 (2022). 

32. Pham, D. et al. stLearn: integrating spatial location, tissue morphology and gene 
expression to find cell types, cell-cell interactions and spatial trajectories within undissociated 
tissues. bioRxiv 2020.05.31.125658 (2020) doi:10.1101/2020.05.31.125658. 

33.  Dosovitskiy, Alexey, et al. An image is worth 16x16 words: Transformers for image 
recognition at scale. arXiv preprint arXiv:2010.11929 (2020). 

34. Tolstikhin, Ilya O., et al. Mlp-mixer: An all-mlp architecture for vision. Advances in 
neural information processing systems 34 (2021): 24261-24272. 

35. Liu, Ze, et al. Swin transformer: Hierarchical vision transformer using shifted windows. 
Proceedings of the IEEE/CVF international conference on computer vision. 2021. 


463 


Pacific Symposium on Biocomputing 2024 


Spatial Omics Driven Crossmodal Pretraining Applied to Graph-based Deep Learning 
for Cancer Pathology Analysis 


Zarif L. Azher 
Thomas Jefferson High School for Science and Technology 
Alexandria, VA 22312, USA 
Email: 2024zazher@tjhsst.edu 


Michael Fatemi 


University of Virginia, Department of Computer Science 
Charlottesville, VA 22904, USA 
Email: myfatemi04@gmail.com 


Yunrui Lu, Gokul Srinivasan, Alos B. Diallo 


EDIT, Department of Pathology and Laboratory Medicine, Dartmouth Hitchcock Medical Center 
Lebanon, NH 03756, USA 
Email: yunrui.lu@dartmouth.edu, gokulsrin@gmail.com, alos.b.diallo.gr@dartmouth.edu 


Brock C. Christensen, Lucas A. Salas 


Department of Epidemiology, Geisel School of Medicine at Dartmouth 
Lebanon, NH 03756, USA 
Email: brock.c.christensen@dartmouth.edu , lucas.a.salas@dartmouth.edu 


Fred W. Kolling IV, Laurent Perreard 


Genomics Shared Resource, Dartmouth Cancer Center 
Lebanon, NH 03756, USA 
Email: fred.w.kolling.iv@dartmouth.edu, laurent.perreard@dartmouth.edu 


Scott M. Palisoul, Louis J. Vaickus, Joshua J. Levy“ 


EDIT, Department of Pathology and Laboratory Medicine, Dartmouth Hitchcock Medical Center 
Lebanon, NH 03756, USA 
Email: scott.m.palisoul@hitchcock.org, louis.j.vaickus@hitchcock.org, joshua.j.levv@dartmouth.edu 


Graph-based deep learning has shown great promise in cancer histopathology image analysis by 
contextualizing complex morphology and structure across whole slide images to make high quality 
downstream outcome predictions (ex: prognostication). These methods rely on informative 
representations (i.e., embeddings) of image patches comprising larger slides, which are used as node 
attributes in slide graphs. Spatial omics data, including spatial transcriptomics, is a novel paradigm 
offering a wealth of detailed information. Pairing this data with corresponding histological imaging 
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localized at 50-micron resolution, may facilitate the development of algorithms which better 
appreciate the morphological and molecular underpinnings of carcinogenesis. Here, we explore the 
utility of leveraging spatial transcriptomics data with a contrastive crossmodal pretraining 
mechanism to generate deep learning models that can extract molecular and histological information 
for graph-based learning tasks. Performance on cancer staging, lymph node metastasis prediction, 
survival prediction, and tissue clustering analyses indicate that the proposed methods bring 
improvement to graph based deep learning models for histopathological slides compared to 
leveraging histological information from existing schemes, demonstrating the promise of mining 
spatial omics data to enhance deep learning for pathology workflows. 


Keywords: spatial omics, transcriptomics, deep learning, graphs, cancer, colon cancer. 


1. Introduction 


1.1. Deep Learning for Pathology 


In recent years, countless studies have demonstrated the potential for deep learning algorithms 
to solve challenging biomedical tasks, thereby improving risk stratification and alleviating the 
potential for clinical burnout by making tedious and unreliable tasks faster and more quantitative, 
potentially leading to improved patient health outcomes !. These algorithms are formulated on 
computational heuristics — specifically, machine learning -- which can make sense of many complex 
data types through the dynamic derivation of relevant patterns and features 74. Analysis of 
pathology data, including whole slide imaging (WSI) — microscopic images of patient tissue — is an 
emerging application in this space, as WSIs are routinely collected and used for patient monitoring, 
diagnosis, and prognostication. Existing works have shown that specially designed deep learning 
algorithms, inspired by processes of the central nervous system, may be able to automate or assist 
in these tasks 5. Most deep neural networks study small micromorphological changes given the 
enormity of these gigapixel images. Graph convolutional networks (GCNs), however, are a 
promising method in this domain, as they can effectively model macro and micro architectural 
features present across WSI in a human-interpretable manner °. Generally, these methods split WSI 
into patches (i.e. more manageable subimages), extract numeric representations (i.e., 
“embeddings”’) from each patch using a predetermined algorithm, and construct a graph where the 
nodes are given patch embeddings and edges are formed based on spatial adjacency 7°. Such 
methods have been applied for tumor stage prediction °, survival analysis 8, and derive numerical 
representations of WSI that can be combined with other omics and imaging modalities ’. 

The optimal algorithm used to extract node features is an area of ongoing research, though many 
works presently use a ResNet convolutional neural network (CNN) pretrained on the ImageNet 
database '° for this task *!'-!*. It has become increasingly common to additionally train these CNNs 
on various image tasks orthogonal to the task at hand to prepopulate an information registry of 
features which will ultimately improve predictive performance in other settings; these techniques 
are known as pretraining. Recently, self-supervised techniques have emerged as promising 
pretraining methodologies, where images are compared from several different vantage points 
without being explicitly labeled. Cross-modal pretraining has recently been highlighted as a 
common self-supervised method by leveraging complementary “paired” information across 
multiple input data types (e.g., images and text) which can improve the representation of all involved 
modalities. Here, we investigate the utility of using spatial omics data, which is paired at 50-micron 
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resolution to the histological information, to pretrain an encoder model for these patches, to 
demonstrate the power of leveraging spatial omics for deep learning-based pathology methods 
which are particularly suited for analysis using graph neural networks (GNNs). 


1.2. Spatial Omics 


Omics data — such as gene expression quantification and DNA methylation — have traditionally been 
collected on a bulk scale where measurements are taken across an entire sample or tissue section. 
Recent advancements in technology have allowed for collection on a more granular scale, such as 
the single cell level, or across specific spots/regions in a slide sample '°. Prior studies have 
demonstrated that deep learning through specialized architectures like GCNs can mine spatial omics 
data to build a more comprehensive understanding of spatial cellular heterogeneity, especially as it 
pertains to how the tumor microenvironment can facilitate/inhibit further disease progression !41, 
Notably, this type of data is not yet commonly available at large scale due to the prohibitive cost of 
these assays as well as batch effects and selection of limited slide area, meaning that methods which 
can learn from spatial omics data and effectively transfer this knowledge to improve other tasks may 
be valuable. Zeng et al !5 previously developed a model which utilized contrastive learning to mine 
a shared representation between image patches and corresponding spatial transcriptomics; however, 
their investigation centered on driving improved understanding on gene domains, rather than 
attempting to leverage the method to enhance downstream clinical outcome modeling in situations 
where only WSI — and no ST data — is available. 


1.3. Contributions 


We hypothesize that additional biological information can be learnt from spatially resolved 
transcriptomics data that may prove relevant for enhancing prediction models across a range of 
histological analyses. Existing works applying GCNs for WSI analysis have not yet leveraged 
spatial omics data to enhance modeling across orthogonal tasks. In part, this is because the quality 
of histological slides for spatially co-registered omics data has been limited as the standard Visium 
spatial transcriptomics (ST) workflow featured manual staining and low-resolution imaging- this 
information does not readily transfer to prediction models on higher resolution histological slides. 
Now, with the development of assays such as the CytAssist which permit the use of sophisticated 
laboratory processing (i.e., autostaining and 40X imaging prior to Visium profiling), the quality of 
slides has remarkably increased and allows for training image models that may more readily transfer 
to related domains. Here, we assess the ability of spatially resolved omics data to enhance 
predictions on a range of different histological assessment tasks by presenting an initial evaluation 
of a crossmodal pretraining mechanism using matched WSI and spatial omics measurements as 
means to encode biological information within WSI graphs to apply in scenarios where spatial omics 
data is not available. We compare this method against other common pretraining schemes on 
downstream predictive analyses (staging, lymph node metastasis, survival prognostication) of WSI, 
as well as explore generated image patch embeddings. Accurate methods for these downstream 
predictive tasks may enable more personalized patient treatments. In this study, we expect developed 
models which can mine for spatial molecular information to outperform the compared approaches 
on these tasks. We aim to demonstrate the potential benefits of utilizing spatial omics — spatial 
transcriptomics, in particular — methods to enhance deep learning-driven pathology analysis. 
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2. Methods 


2.1. Data Collection and Preprocessing 


Visium spatial transcriptomics data matched with WSI was collected from four colorectal cancer 
patients from the Dartmouth Hitchcock Medical Center, to serve as a training dataset for the 
crossmodal patch embedding method. This process was conducted through the 10x Genomics 
Visium spatial transcriptomics workflow, featuring H&E staining, followed by mRNA profiling and 
whole slide imaging. Spatial transcriptomics data were filtered to include the top 1000 most variable 
genes across slides identified by SpatialDE '°. Separately, 708 WSIs were collected from colorectal 
cancer patients from the Dartmouth Hitchcock Medical Center, for whom, histological stage 
annotations were available. Finally, WSIs were obtained for a cohort of 350 colorectal cancer 
patients from The Cancer Genome Atlas (TCGA) for whom survival information and lymph node 
metastasis information was available. All WSIs were stain normalized using the Macenko !” method. 
Collected WSIs were split into non overlapping 224 x 224 patches via the PathflowAI Python 
package !8, whose embeddings served as node attributes in a graph. We compared several methods 
described below to encode information for these patches, which is the main focus of this study. 
Nodes were connected with edges based on spatial adjacency using the knn_graph (k-nearest 
neighbor) method from the torch cluster Python package, with k=16. Patients from the in-house 
dataset and TCGA were separately partitioned into training, validation, and testing sets using a 
random 80/10/10 split. The collected datasets and the downstream tasks they were used on, are 
summarized below: 
1. Visium spatial transcriptomics slides (n=4; 20,000 spots/patches; Co-Registered 
Spatial Transcriptomics, H&E WSI): to pretrain contrastive crossmodal model 
2. Dartmouth Hitchcock Medical Center (n=708 H&E WSI): used for histological stage 
prediction and clustering analysis 
3. TCGA Cohort (n=350 H&E WSI): used for lymph node metastasis prediction, survival 
prognostication, and tumor infiltrating lymphocyte (TIL) alignment analysis 
All analyses were conducted on a machine using a single Nvidia Tesla v100 GPU with 32 
gigabytes of VRAM, and 100 gigabytes of RAM. 


2.2. Patch Level Pretraining Methods 
Three embedding production methods were compared for the 224x224 patches used as nodes of the 
graphs representing WSI. 


2.2.1. ImageNet-Pretrained ResNet18 


A ResNetl8 CNN model pre trained on the ImageNet dataset (commonly used for embedding 
histopathology patches) was accessed using the forchvision Python package 
(https://github.com/pytorch/vision). The model was truncated through the penultimate layer, to 
extract length 512 vectors/embeddings for each input patch. 


2.2.2. Ciga Self Supervised Histopathology Pretrained ResNet18 


A separate ResNet18 CNN model pretrained using a self-supervised learning (SSL) SimCLR ! 
contrastive procedure on histpathological imaging datasets was similarly accessed and truncated 
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through the penultimate layer to extract length 512 embeddings for all patches. In summary, 
SimCLR employs an objective function that encourages similarity between embeddings from 
augmented (1.e., “corrupted”) views of the same image, while penalizing based on dissimilarity 
between views from different images. This model was made publicly available by Ciga et al 7°, and 
has been previously shown to outperform the aforementioned ImageNet-pretrained model on a 
variety of downstream modeling tasks. 


2.2.3. Spatial Omics-driven Crossmodal Pretrained Encoder 


A contrastive cross-modal model encoding image patches and spatial transcriptomic profiles was 
created, similar to the model implemented by Zeng et al ?!. Input images patches of size 224x224 
were encoded into embeddings of size 512 units, using the feature extraction portion of a CNN 
initialized with weights initialized from the ResNet model trained by Ciga et al. Spatial 
transcriptomics profiles containing expression of the most spatially variable 1000 genes across 
Visium slides, selected to avoid overfitting on genes with imprecise expression, were encoded with 
three standard fully connected (FC) layers of size 512. The embeddings from co-registered patches 
from each modality (ST, WSI) were passed through a common projection layer of size 512, to output 
a single embedding per modality (ie; one vector of length 512 which describes an image patch, and 
one of length 512 which describes the corresponding gene expression). Crossmodal and unimodal 
contrastive penalties are applied using the SimCLR loss function !°; during training, several 
augmentation strategies were applied to both the image patches and corresponding transcriptomic 
profiles to generate “corrupted” representations of each data type as means for comparison. 
Transcriptomic profiles were randomly masked and corrupted with noise with 30% probability. 
Images were augmented using a series of random flips, color jitter transforms, random grayscaling, 
random rotation, and random image solarization. Both the original and augmented image patches 
and transcriptomics profiles were encoded using the aforementioned neural network layers. The loss 
mechanism penalizes the model based on the difference between the embeddings from the original 
and augmented data from each modality. A crossmodal loss is used to maximize the similarity 
between the corrupted image and transcriptomic embeddings from the same patch. These three loss 
functions (augmented image to image, augmented transcriptomics to transcriptomics, augmented 
image to augmented transcriptomics) were summed to optimize the crossmodal contrastive model. 

This model was trained for 150 epochs with a batch size of 8 and a learning rate of 0.00001. 
Visium sections corresponding to six patients were partitioned into the training set, and tissue 
sections from two patients were partitioned to the validation set. Validation set loss was used to 
inform selection of the top model, following training. The RELU activation function was applied to 
outputs of every layer. The image encoder pretrained using the spatially co-registered 
transcriptomics information and the subsequent projection head were retained for subsequent 
analysis, and were used to embed image patches which GNN models were to operate on. The 
remaining layers of this pretrained model were not utilized. The usage of this image encoder derived 
using this training protocol for other ancillary tasks is the primary focus of this study, compared to 
the other image encoders (weights from ImageNet, Ciga et al.). This model is further described 
along with data collection procedure, in Figure 1. 
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Figure 1: A) Data collection protocol for Visium spatial transcriptomics slide. B) Training protocol for spatial omics- 
driven crossmodal contrastive model; two views are generated per modality, per patch; each view is passed through 
the corresponding branch of the crossmodal model; embeddings are transformed using a shared projection head; 
unimodal and crossmodal contrastive losses are applied to output embeddings. 


2.3. Downstream Outcome Prediction 


We sought to understand whether CNN encoders, pretrained on co-registered spatial transcriptomics 
data, could enhance the predictions on a range of different GCN tasks. A graph convolutional 
network was constructed to take an input graph of nodes represented by length 512 embeddings, 
followed by three GCNConv graph convolutional layers ? to contextualize and aggregate 
embeddings into length 128, with SAGEPooling pooling ” layers (ie: 30% of patches retained, for 
subsequent layers; SAGEPooling stochastically samples higher-order neighborhoods of patches) 
placed after each convolutional layer. These pooling layers learn to downsample graphs, to push the 
model to learn focused information relevant to the training task. Graph embeddings were aggregated 
using global mean pooling after each SAGEPooling layer. These embeddings were combined using 
the JumpingKnowledge mechanism, resulting in a single vector of length 128 to represent the entire 
input graph/WSI. Finally, two fully connected layers were applied to this embedding, followed by 
a single output layer. The model (Figure 2) was applied to the following prognostication-focused 
experiments/outputs to assess patch encoding mechanisms: 


2.3.1. Histological Stage Prediction 


The in-house dataset was used to train and assess model capability to predict dichotomized tumor 
histological stage (T-stage; signifies depth of invasion) - low (stage 0, stage 1, stage 2) or high (stage 
3, stage 4). A sigmoid function was applied to the output of the final layer in the GCN, and model 
training was supervised using a binary crossentropy loss function. 
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2.3.2. Survival Prognostication 


The TCGA dataset was used to train and evaluate GCNs to assess for time to death using hazard 
predictions, indicating the real-time risk of death. Model training was supervised using a standard 
Cox loss, which considers the predicted risk, patient censor status, and duration (either days to death 
or days to last follow up). This setup entails the proportional hazards assumption, that predictors 
have a constant hazard ratio (i.e., relative risk between two patient groups) over time. 

All GCN models were trained for up to 30 epochs, using a learning rate of 0.001 and batch size 
8. Top model checkpoints were selected for evaluation following training, based on validation set 
loss. GCN models were implemented using the Pytorch Geometric 7+ Python package. Three 
separate GCN models were trained for each prediction task - one for each patch embedding 
mechanism. Stage prediction and lymph node metastasis models were evaluated on held-out test 
sets using Fl-score and area under the curve (AUC), while C-index was used to evaluate 
prognostication models. These metrics are reported using 95% confidence interval derived from 
1000 sample non-parametric bootstrapping procedures. 


A Graph Convolutions 
and SAGEPool 


Metastasis 


Survival 
Prognosis 


Fully Connected Layers 
and Output 
B| JumpingKnowledge 


Figure 2: Overview of generalized GCN for downstream outcome modeling; initial patch embeddings vary across 
experimentation. A) Graph convolution layers contextualize each node embedding; after each such layer, SAGEPool 
operators aggregate nodes/patch embeddings, removing up to 70% of them, to only retain informative ones. B) A 
JumpingKnowledge scheme aggregates embeddings across graphs to create a single embedding for the image. C) The 
image embedding is used to make downstream predictions. 


2.4. Embedding Clustering Quality Analysis 


The ability of patch embeddings to capture morphological and molecular heterogeneity across slides 
was assessed across embedding methods, using an unsupervised clustering approach and the in- 
house dataset. For each WSI in the dataset, KMeans clustering (k=5; chosen via coarse optimization 
to ensure stability when run numerous times) was applied to the patch embeddings derived by each 
pretraining method (standard ResNet, Ciga et al, spatial pretrained) to elucidate sub-groups of 
patches implicitly captured by the representations. Clusters were plotted across slides to visually 
ensure that they represented different morphologies and structures within slides. Subsequently, the 
Calinski-Harabanz (CH) index 7° and the Davies-Bouldin (DB) index 7° were computed for the 
clustering result for each pretraining strategy. The ANOVA-based CH score assesses the density 
and separation of clusters, with a higher value indicating greater density within clusters and 
separation among different clusters. Similarly, the DB index measures the ratio between within- 
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cluster and cross-cluster separation. Thus, superior patch embeddings should result in a relatively 
high CH index and low DB index. The per-WSI scores were used to calculate average CH index 
and DB score at a 95% confidence interval, for each pretraining method. 


2.5. TIL-based Model Interpretation 


Previous research has demonstrated the importance of tumor infiltrating lymphocytes (TILs) and 
the tumor microenvironment on the progression of colon cancer 7’. We sought to demonstrate the 
interpretability of GCN models developed here using the TCGA dataset, by comparing regions of 
WSI given high attention with previously published predicted TIL maps ** for corresponding slides. 
Patches deemed important by GCN models trained on lymph node metastasis prediction were 
determined by extracting patches remaining in WSI graphs following the final pooling layer; for a 
given patch, being left in its graph by a GCN model following three pooling layers, indicates its 
significance to the model. The coordinates of these patches were compared to those describing the 
locations of predicted TILs via Wald Wolfowitz testing 7’, where the null hypothesis would indicate 
high overlap between these two sets of coordinates. Accordingly, Wald Wolfowitz testing was used 
to calculate a test statistic per slide per GCN model trained with each patch embedding method- 
negative values of this test statistics, W, represents the localization of TILs. Spearman’s rank 
correlation coefficients (alpha p-value = 0.05) were calculated to evaluate the relationship between 
the test statistic (W), and predicted hazard. A negative correlation coefficient would suggest a 
statistically significant association between predicted hazard and TIL spatial localization, following 
biological knowledge holding that TILs help inhibit colon cancer proliferation and migration °°. Test 
statistics were further dichotomized to indicate presence/lack of TIL localization, to compare these 
relationships across the GCN model using embeddings derived from the Ciga et al method, versus 
the model using spatially pretrained embeddings. 


3. Results* 


3.1. Quantitative Predictive Analysis 


Held out testing-set performance for GCNs trained to predict stage, lymph node metastasis, and 
survival prognosis, are presented in Table 1; models which used patch embeddings derived from 
the spatial omics-driven mechanism outperformed those using the compared methods for all three 
experiments. 


Table 1: Test set performance metrics (95% confidence interval) of GCNs trained using various patch embedding 
mechanisms, for binary stage prediction, lymph node metastasis prediction, and survival prognostication. 


Task Measure ImageNet ResNet Ciga et al ResNet Spatial Pretrained 
Stage Prediction AUC 0.935 + 0.003 0.948 + 0.002 0.981 + 0.001 
F1-Score 0.863 + 0.004 0.858 + 0.004 0.878 + 0.004 
Lymph Node AUC 0.651 + 0.004 0.612 + 0.004 0.708 + 0.003 
Metastasis F1-Score 0.560 + 0.002 0.630 + 0.003 0.671 + 0.005 
Survival C-index 0.597 + 0.003 0.582 + 0.002 0.638 + 0.002 
Prognostication 


t Supplementary materials can be found at the following DOI: https://doi.org/10.528 1/zenodo.8 197573. 
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For the classification experiments, models using embeddings derived from the spatial omics- 
driven mechanism outperformed those which used embeddings from the ImageNet-trained 
ResNet18 CNN by an average of 6.98% measured by AUC, and outperformed models using 
embeddings derived from the ResNet18 pretrained by Ciga et al, by average of 9.47%. GCNs 
using spatial omics-driven embeddings (C-index 0.638) also outperformed ImageNet-trained 
ResNet18 embeddings (C-index 0.597) and embeddings derived from the model trained by Ciga et 
al (C-index 0.582). 


3.2. Clustering Evaluation 


A KMeans clustering approach paired with CH index and DB index calculation was employed to 
compare the abilities of these different pretraining approaches to elucidate molecular and 
morphological heterogeneity across slides; the results of this analysis are presented in Table 2. An 
example visualization including regions of a slide assigned to clusters indicating by different 
coloring, is presented in Figure 3; additional examples are available in Supplementary Figures S2 
and S3. 

Embeddings from the contrastive crossmodal spatial model resulted in a significantly higher CH 
index and lower DB index, versus both the ImageNet-pretrained ResNet and the ResNet trained on 
histopathology datasets via self-supervised learning by Ciga et al. 
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Figure 3: Example visualization of clustering of embeddings derived using various methods, for a single WSI. 


Table 2: Clustering quality metrics calculated across embedding methods. 


Measure ImageNet ResNet Ciga et al ResNet Spatial Pretrained 
Calinski-Harabanz Index 643.76 + 17.51 786.70 + 20.40 2605.68 + 70.66 
Davies-Bouldin Index 1.90 + 0.01 1.719 + 0.01 0.975 + 0.01 
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3.3. Model Interpretation 


Spearman’s correlation coefficient values testing the relationship between lymph node metastasis 
risk predicted by GCN models using various patch embedding mechanisms and TIL localization 
elucidated via Wald Wolfowitz testing, are presented in Table 3 along with corresponding p-values, 
suggesting both the Ciga and spatial pretrained models were able to derive TIL-associated 
embeddings related to instantaneous hazards. Boxplot visualizations comparison of predicted model 
risk and dichotomized TIL alignment are presented in Supplementary Figure S4. 


Table 3: Spearman’s correlation coefficient values for TIL localization versus predicted lymph node metastasis risk, 
across GCN models using various patch embedding methods. 


ImageNet ResNet Ciga et al ResNet Spatial Pretrained 
Spearman’s Coefficient -0.061 -0.426 -0.218 
Spearman’s P-value 0.2693 2.2e-16 7.74e-5 


4. Discussion and Conclusion 


This is the first study which aims to determine whether leveraging spatial omics data to pretrain 
image patch encoders using a cross modal contrastive mechanism can improve downstream 
performance in graph convolutional networks, which may improve automated cancer patient 
analysis. While most prior research leveraged a GCN to integrate spatially localized omics with 
imaging for spot-level spatial transcriptomics enhancement or histological feature extraction tied to 
bulk transcriptional characteristics, our approach discerns spatial transcriptomics features from 
standalone slides. Recognizing the inaccessibility of spatial transcriptomics data, we employed 
transfer learning to apply extracted spatial transcriptomics features to a diverse range of subsequent 
tasks. We compared spatial omics-driven embeddings against those extracted from a standard 
ResNet18 CNN pretrained on the ImageNet dataset, and a ResNet18 pretrained using self- 
supervised learning on histopathology datasets. GCN models trained and evaluated using the 
spatially enhanced embeddings outperformed those using the baseline embedding methods on three 
downstream tasks — stage prediction, lymph node metastasis prediction, and prognostication. This 
suggests that incorporating spatial transcriptomics information into the pretraining process of image 
patch encoders, enhances the quality of learned representations, beyond what is extracted from state- 
of-the-art techniques which use solely images for patch encoding pretraining. 

Additional quantitative analysis from clustering patch embeddings indicates that the models 
leveraging spatially-pretrained embeddings were superior at capturing distinct heterogeneities 
across slides, versus models using patch embeddings from existing strategies. Thus, we expect 
future applications of the developed spatial pretraining method for patch embeddings, to improve 
the performance of workflows aiming to capture tissue heterogeneity, including tumor 
subcompartment segmentation. 

Furthermore, Wald Wolfowitz testing paired with Spearman’s correlation coefficients, suggests 
that GCN models using embeddings from the spatial pretraining method and the Ciga et al method, 
learned to highlight TILs to contextualize prognostic assessment of cancerous tissue when 
considering lymph node metastatic potential, particularly in patients whom the models understood 
to be at lower risk. The Spearman’s coefficient value for the GCN model using ImageNet ResNet 
patch representations was markedly closer to 0 versus the other two methods, indicating far weaker 
correlation in this relationship. Interestingly, the magnitude of the coefficient for the GCN model 
using the Ciga et al embeddings was nearly double that of the spatially pretrained embeddings, 
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indicating that the Ciga et al method may induce greater tendency to turn to TILs for understanding 
patient profiles. Though this does not indicate greater predictive power among models, that such 
nuances can be extrapolated related to model reasoning, demonstrates the interpretability of graph- 
based modeling for cancer histopathology, and further emphasizes the importance of enhancing the 
ability of such methods. 

Overall, our results indicate that spatial omics data can be effectively mined in a crossmodal 
fashion, to improve existing image-based deep learning workflows to analyze cancer 
histopathology; this also adds to the growing body of literature *!°? which reflects the importance 
of enhancing pretraining mechanisms as a basis of improving deep learning models for cancer 
histopathology. Notably, ours is the first study to mine spatial omics data in the pretraining process 
to enhance the capability of such image-based models, while others have focused on mechanisms 
which use solely imaging. Several AI methods also exist to integrate spatial transcriptomics with 
histology through contrastive learning to improve the identification of spatial domains. This work 
differs from prior approaches as it aims to improve the extraction of imaging information on held- 
out tissue slides from which Visium spatial transcriptomics assaying has not been done, training 
with paired imaging and spatial expression data to enhance this capability. 

A key limitation of this study is the relatively small dataset used to pretrain the spatially- 
enhanced crossmodal contrastive model; spatial transcriptomics data was only generated for 4 total 
slides due to high resource and time costs and the limited size of the tissue placement area on Vistum 
slides. Furthermore, coarse hyperparameter search was used to select GCN architecture parameters, 
as a detailed experiment here was beyond the scope of this study. It should be noted that optimization 
of the convolutional neural network and GCN parameters can be done end-to-end, i.e., 
simultaneously, which can improve predictive results— as will incorporating additional varied 
histologies and tumor characteristics, improved specimen processing/imaging using the CytAssist 
and commensurate hardware to fit larger models. Future works will seek to use larger cohorts to 
pretrain the spatial model to improve quality of extracted embeddings. Additionally, the embeddings 
from the spatially enhanced model can be evaluated for use in applications other than GCNs, such 
as Transformer networks — which have become popular in cancer histopathology in recent years *4> 
— histology image search, and multimodal data integration. 
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The advent of spatial transcriptomics technologies has heralded a renaissance in research to advance 
our understanding of the spatial cellular and transcriptional heterogeneity within tissues. Spatial 
transcriptomics allows investigation of the interplay between cells, molecular pathways, and the 
surrounding tissue architecture and can help elucidate developmental trajectories, disease 
pathogenesis, and various niches in the tumor microenvironment. Photoaging is the histological and 
molecular skin damage resulting from chronic/acute sun exposure and is a major risk factor for skin 
cancer. Spatial transcriptomics technologies hold promise for improving the reliability of evaluating 
photoaging and developing new therapeutics. Challenges to current methods include limited focus 
on dermal elastosis variations and reliance on self-reported measures, which can introduce 
subjectivity and inconsistency. Spatial transcriptomics offers an opportunity to assess photoaging 
objectively and reproducibly in studies of carcinogenesis and discern the effectiveness of therapies 
that intervene in photoaging and preventing cancer. Evaluation of distinct histological architectures 
using highly-multiplexed spatial technologies can identify specific cell lineages that have been 
understudied due to their location beyond the depth of UV penetration. However, the cost and inter- 
patient variability using state-of-the-art assays such as the 10x Genomics Spatial Transcriptomics 
assays limits the scope and scale of large-scale molecular epidemiologic studies. Here, we investigate 
the inference of spatial transcriptomics information from routine hematoxylin and eosin-stained 
(H&E) tissue slides. We employed the Visium CytAssist spatial transcriptomics assay to analyze 
over 18,000 genes at a 50-micron resolution for four patients from a cohort of 261 skin specimens 
collected adjacent to surgical resection sites for basal cell and squamous cell keratinocyte tumors. 
The spatial transcriptomics data was co-registered with 40x resolution whole slide imaging (WSI) 
information. We developed machine learning models that achieved a macro-averaged median AUC 
and F1 score of 0.80 and 0.61 and Spearman coefficient of 0.60 in inferring transcriptomic profiles 
across the slides, and accurately captured biological pathways across various tissue architectures. 


Keywords: Deep Learning, Machine Learning, Spatial Transcriptomics, Skin Photoaging. 


1. Introduction 


Spatial transcriptomics is an innovative and rapidly evolving field in biomedical research that 
combines the power of genomics and spatial mapping techniques to gain insights into the spatial 
organization of gene expression within complex tissues, such as the skin. By providing a detailed 
view of gene expression patterns in relation to cellular and tissue architecture, spatial 
transcriptomics has quickly become a valuable tool for biomedical research, including 
dermatological research. 

The skin is the largest organ in the body, composed of multiple cell types that each play a crucial 
role in maintaining its structure and function. Though traditional genomic analysis techniques, such 
as bulk-RNA sequencing (RNA-seq), and disaggregated techniques, such as single cell RNA 
sequencing (scRNA-seq), have provided valuable information about cellular heterogeneity and 
disease progression, they lack the ability to assess localized gene expression patterns that may relate 
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with cell-cell interactions and architecture to support tissue function. Spatial transcriptomics 
approaches uniquely allow researchers to examine gene expression patterns within their anatomical 
and histological context, enabling a deeper understanding of the underlying molecular mechanisms 
driving skin biology, carcinogenesis, and disease progression. 

An important potential application of spatial transcriptomics in dermatology is to advance the 
emerging study of skin aging.' The skin serves as a barrier between the environment and the body 
where it is exposed to near-constant insults, including ultraviolet radiation (UVR), mechanical 
stress, and toxicants.2 These exposures, along with genetic influences, combine to induce skin 
damage, reduced function, and, ultimately, a characteristic loss of elasticity of the skin largely 
reflecting degradation of the collagen matrix. More recently, Zou et al. created a single-cell 
transcriptomic atlas of human skin aging using eyelid tissue and identified cell-type-specific 
associations with human skin aging.' Further characterization of cellular changes that incorporate 
spatial information in skin can inform therapeutic strategies and interventions to combat age-related 
skin alterations and disease. 

Currently, spatial transcriptomics technologies at whole transcriptomic-level multiplexing are 
incredibly costly and prone to several sources of variation (e.g., within/between-subject variation), 
limiting broad application. Recently, deep learning models have been proposed as a cost-saving 
alternative to predict spatial gene expression from routine tissue stains. For instance, the 
DeepSpaCE approach includes convolutional neural networks (CNNs) for spatial gene cluster and 
gene expression prediction in human breast cancer tissue sections.t Another modeling paradigm 
aimed at predicting spatial gene expression across breast and cutaneous tumor data used a mix of 
transformer and graph neural network-based approaches.° In addition, the performance of several 
different modeling approaches for spatial gene expression prediction in tissue was recently 
compared using stage-III (pT3) colorectal tumors. Though these studies demonstrate the potential 
to infer spatial expression patterns using histomorphological data, several crucial questions 
remained unanswered, including the applicability of these methods to non-cancerous tissue sections, 
to other biological domains (e.g., dermatology), as well as the extent to which prediction modeling 
can preserve salient biological pathways and relationships required for downstream analysis on 
larger cohorts. 

In this pilot study, we develop and validate a deep learning method for the prediction of spatial 
gene expression across spatially variable genes in routine H&E-stained skin tissue. Predictions can 
be used to create synthetic multidimensional tissue maps—similar to those produced through spatial 
transcriptomic profiling—for tissues without corresponding spatial transcriptomics data. Use of 
deep learning models promises to reduce the cost and time associated with spatial transcriptomics 
data acquisition for dermatological applications, greatly expanding access to the technology and its 
range of unique insights. Interrogating pathways associated with spatially inferred genes can 
advance our knowledge of skin biology, improve diagnostic tools, and pave the way for more 
personalized treatment strategies. 


2. Methods and Materials 


In this work, we attempt to predict the spatial gene expression of Visium spatial transcriptomics 
spots distributed across 40X magnification H&E slides. To this end, we use the following methods: 
1. Data collection and annotation: Acquired H&E whole slide images (WSJ), and spatially 
registered Visium CytAssist assayed spatial transcriptomics slides from 4 human cheek skin 

tissue samples collected from sites histologically adjacent to basal cell carcinoma (BCC) and 
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squamous cell carcinoma (SCC) during skin cancer removal surgery. These samples were 
then graded by dermatologists for their solar elastosis status (two mild, two severe). 
Additionally, dermatologists annotated regions corresponding to distinct histological entities 
(e.g., epidermis, eccrine glands, hair follicles, sebaceous glands, and vascular/endothelial 
infrastructure). Instances of actinic keratoses were documented from a larger cohort. 

2. Preprocessing: Preprocess gene expression and WSI subarrays to capture spatially variable 
genes and genetically dense regions of tissue. 

3. Model development: Configure the SWIN-T transformer to perform two distinct modeling 
tasks, binary (dichotomized expression) and continuous gene expression prediction, on the 
1000 most spatially variable genes. 

4. Leave one-patient-out cross-validation: Evaluation on held-out slides/patients as a 
measure of external applicability. 

5. Recover spatial biology inferences: Model performance was further measured using: 1) 
pathway analysis for high performing genes, 2) topological consistency between ground 
truth and predicted expression, and 3) the ability to recapitulate genes and pathways 
associated with distinct histological structures. 

Each of these steps will be further detailed in the ensuing sections. 


2.1. Data Collection 


Figure 1: Cohort Description. 261 WSIs were scanned. Four of these slides underwent further spatial 
transcriptomics profiling and were annotated for distinct histological architectures. Two are shown here. 
From left to right, the WSI, histological annotations, and Visium spatial transcriptomics spot array for (A) 
sample 14 and (B) sample 167. 
Four specimens were collected for profiling from a cohort of 261 tissue samples obtained in a single 
site Mohs micrographic surgery (MMS) clinic between March 1* 2022, and October 10" 2022. The 
samples were mostly from the head and neck, and all from sites histologically adjacent to either 
basal cell carcinoma or squamous cell carcinoma, as confirmed by histologic analysis of frozen 
section slides. The tissue was removed as part of standard surgical practice as Burow’s triangle flaps 
for skin grafting/reconstruction. Triangles are normally discarded—two triangles were collected per 
patient, in some cases bisected. One triangle underwent formalin fixation while the other triangle 
was frozen. Formalin-fixed specimens were breadloafed, encased in paraffin-embedded tissue 
blocks, and sectioned and stained for hematoxylin and eosin (H&E) using Autostainers for 
subsequent imaging at 40x resolution (0.25 micron/pixel) using Aperio GT450 scanners. Tissue 
slides were transported to the Genomics core, where after tissue decoverslipping, the Visitum 
CytAssist device was used to transfer transcriptomic probes from the original glass slides to 
11mmx11mm capture areas on Visium slides. Sections from two patients were placed into each 
capture area to conserve costs and separated during the analysis stage. Whole transcriptomic 
profiling was accomplished after mRNA permeabilization, poly(A) capture, and probe 
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hybridization. The eosin stain for the tissue sections were imaged using CytAssist, which were then 
co-registered to the original 40X whole slide images (WSI). Given the limited sample size due to 
the spatial transcriptomics assay costs, four specimens were selected, representing cheek tissue from 
four females, two with mild elastosis (participant #178 and #14, ages 24 and 76 respectively), two 
with severe elastosis (participant #167 and #107, ages 55 and 84) respectively. 


2.2. Preprocessing 


Prior to processing, Visium spatial transcriptomics profiles for samples contained 18,085 genes 
measured across several thousand locations throughout each slide. Each profile was then subjected 
to preliminary filtering, where genes and spots were filtered according to their abundance (i.e., cells 
with less than 500 genes, genes expressed in less than 3 cells, and cells with more than 15% 
mitochondrial gene expression were filtered out). After filtering out the regions lacking tissue using 
a custom annotation tool augmented by the SAM, the total number of Visium spots per slide reached 
2561, 3279, 3547, and 1737, each sampled in a honeycomb formation. Each Visium spot covers a 
circular capture area with a diameter of 50-micron (~200 pixels) at 40x magnification. After 
sequencing, we used the SpaceRanger package to preprocess the Visium reads into gene count 
matrices. 

Every whole slide image (WSIs) used for the Visium assay captures an area (size of capture 
area— 11 x 5.5 mm- half the capture area per patient) that spans tens of thousands of pixels along 
each dimension. Accordingly, to make the prediction task computationally tractable, we subdivided 
every WSI into square 512 by 512-pixel image patches (i.e., subarrays) centered on each Visium 
spot. The gene expression of the central 50-micron Visium spots were aligned to each image patch. 
Data present within the image patch but falling outside the capture area of the Visium spot were 
considered to have less direct relevance to the cells being assayed. Spots were additionally annotated 
based on the aforementioned tissue histological structure using the Annotorious OpenSeadragon 
plugin. 


2.3. Model Development 


2.3.1. Inference Targets 


As predicting all of the genes assayed is computationally intractable, we used the SpatialDE library 
to select the top 1000 genes based on their mean spatial variance (MSV) across all slides (i.e., 
selected genes that exhibited the greatest spatial variation across the 4 slides). We then tested the 
capacity of our models to predict both dichotomized and log gene expression for all 1000 genes. 

More specifically, in the dichotomized prediction task, patches were classified as having a 
“high” or “low” gene expression for each gene if the expression of the gene at that patch location 
was greater or lower than its mean gene expression across all other Visitum spots in the 
corresponding WSI. This approach follows existing work detailed in Fatemi et al.° For this task, 
models were trained using a binary cross entropy loss function. 

In the continuous expression task, by contrast, models were trained to predict the log- 
pseudocount log(1+counts) gene expression for each gene within the corresponding image patch 
region. For this task, loss was calculated using the mean squared error, which was found to be 
comparable to modeling counts using the zero-inflated negative binomial distribution. 
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2.3.2. Modeling Approach 


Previous work has established the importance of spatial and neighborhood context information in 
both the dichotomized and continuous gene expression tasks.° In this study, we leveraged the SWIN- 
T vision transformer, a hierarchical transformer that has gained repute for building hierarchical 
feature maps by iteratively merging information from nearby image patches in deeper layers.’ 
Transformers divide images into smaller subimages, and numerical descriptors are extracted for 
each subimage using convolutional filters, along with information on the relative positioning of the 
subimages. Self-attention mechanisms are used to route information across the image based the 
relevance of one subregion of the image to another. In both the dichotomized and continuous 
expression tasks, the output layer of the base SWIN-T model was modified. In particular, the output 
layer was expanded to consist of two feed-forward layers of sizes 768 and 2000, chosen through 
coarse experimentation to maximize model performance. Both the dichotomized and continuous 
expression models yielded predictions for the 1000 most spatially variable genes. 


2.3.3. Data Augmentation and Hyperparameter Selection 


To improve the robustness and generalizability of these models to varied histological contexts, all 
images in the training set were subject to a series of data augmentation transformations implemented 
using the Albumentations package.® Images were first resized to 448 by 448 pixels in size, the input 
dimensionality for the SWIN-T model. Horizontal flips and random brightness contrast were then 
performed with probabilities 0.5 and 0.2, respectively. A shift, scale, and rotate transformation was 
also applied to every image with a probability of 0.3. A shifting limit of 0.1, a scaling limit of 0.1, 
and a rotation limit of 30 were used here. Additionally, random rectangular areas of the images were 
erased— a maximum of 8 holes were produced per image, each hole obscuring at most 16 by 16 
pixels. 

Hyperparameters were obtained for both the dichotomized and continuous expression models 
via a coarse hyperparameter grid search. For the dichotomized models, optimal performance was 
observed while using a batch size, learning rate, and training length of 64, 0.5x10—6, and 20 epochs. 
Whereas for the continuous models, optimal performance was observed while using a batch size, 
learning rate, and training length of 64, 0.33 x 10-6, and 20 epochs. The Lion optimizer was used 
in both cases.? 


2.4. Cross Validation 


Model performance was measured via leave-one-patient-out cross-validation (LOOCV). In this 
procedure, three of the four Visitum spatial transcriptomics samples were used for training and 
validation, while the remaining sample was used for testing. This procedure was repeated four times 
to account for all possible training/testing combinations. Reported performance metrics for each 
gene are the macro-averaged (across slides, weighting each slide equally) median (across genes) 
area under the receiver operating characteristic curve (AUROC) and F1 score (F1) statistics for the 
dichotomized task, and correlation coefficients (Spearman coefficient) to compare true versus 
predicted pseudocounts— log(1+counts)— for the continuous task. Macro performance statistics 
underwent 1,000 sample nonparametric bootstrapping across the Visium spots to yield 95% 
confidence intervals. 
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2.5. Biological Salience 


To assess for the model’s ability to capture meaningful biological information from tissue histology, 
model predictions were also scrutinized for their ability to 1) recapitulate a range of biologically 
salient pathways, 2) maintain the shape and spatial signature of ground truth spatial gene expression 
data in lower dimensional space (i.e., preserves key relationships; spots cluster similarly) using the 
aligned-UMAP procedure, and 3) facilitate the inference of biologically salient features, such as 
histological markers. These tasks are detailed below. 


2.5.1. Pathway Analyses 


Given the nature of histomorphological data, it is unreasonable to expect that every gene can be 
predicted from tissue histology alone. Accordingly, we sought to determine the biological pathways 
associated with sets of differentially performing genes to answer understand what biological 
properties make a gene amenable to prediction. We utilized the GO Biological Process 2023 
database through the EnrichR package, to perform a pathway analysis on predicted genes stratified 
by decile after ranking genes based on predictive performance.'°!! The top 3 pathways were selected 
for each decile—from 90" to 100" decile (i.e., top performing genes) to the 0" to 10" device (i.e., 
worst performing genes)—based on their combined score (i.e., magnitude of representation and 
statistical significance). Detected pathways were also filtered by tissue specificity (i.e., could 
reasonably be involved with the skin). 

We further sought to identify whether the gene signatures correspondent to different histological 
architectures was congruent between true and predicted expression. First, the top 100 most 
differentially expressed genes were found using the Wilcoxon rank-sum test in a one vs. rest fashion 
for each tissue architecture (e.g., follicles versus non-follicular structures) using both predicted and 
ground truth data. A pathway analysis using GO Biological Process 2023 database through the 
EnrichR package was then performed for the top true and predicted differentially expressed genes 
for each architecture. The top 10 pathways were selected by combined score for each histological 
category. Detected pathways were compared between ground truth and predicted gene expression 
for each sample under the hypothesis that similar pathways should be associated with the same 
architectures. 


2.5.2. Similar Clustering of Visium Spots and Consistent Topology via Aligned- UMAP 


Model predictions were further assessed for their ability recapitulate the topology (i.e., relationships 
between spots) of ground truth Visium spatial gene expression data within a lower dimensional 
space. This was accomplished through the comparison of Uniform Manifold Approximation and 
Projection (UMAP) embeddings (i.e., numerical representations that could be plotted in a 2D 
scatterplot; closer points share similar expression/biological relevance) for the ground truth and 
predicted expression profiles (on held-out slides) extracted using the SWIN-T model. Ground truth 
and predicted gene expression profiles for each slide were co-projected to a lower dimensional space 
using the Aligned-UMAP procedure to preserve the relative orientation and alignment between 
spots to enable comparison between the approaches. Each Visium spot from the WSI was plotted as 
2D scatterplot point and colored according to its gene expression profile as dictated by the Leiden 
clustering algorithm. In other words, ground truth Visium spots sharing similar transcriptional 
information are grouped to the same Leiden cluster, while genetically dissimilar spots are grouped 
to different Leiden clusters. These ground truth cluster assignments were overlaid on the scatterplots 
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for the predicted expression patterns. It is expected that the relative positioning between the clusters 
would be preserved in the 2D scatterplot for the predicted expression, which would measure the 
extent to which model predictions recapitulated patterns associated with distinct histological regions 
of each WSI. 

Aside from overlaying the original ground truth clusters, predicted expression profiles were also 
separately clustered through the Leiden algorithm, yielding a separate set of cluster assignments for 
the same Visium spots. These assignments were compared to the ground truth Visium spatial 
transcriptomics profiles’ clusters. Similar clustering assignments would provide further evidence 
for greater correspondence between transcriptional data and information derived from the histology. 


3. Results* 


3.1. Prediction of Spatial Transcriptomic Patterns from Histology 


In the dichotomized prediction task, the SWIN-T vision transformer model achieved a macro- 
averaged (i.e., across genes) median AUC and F1 score of 0.80 and 0.61, respectively, across the 
testing sets (Supplementary Table 1). The model performed best on genes ADJPOQ, PLIN1, and 
PKP3 (involved in fatty acid metabolism", triacylglycerol storage!3, and desmosome function and 
stability!*, respectively) , and worst on genes ANKRD35, ALAS, and MIA (of which the latter two 
are known to be involved in heme biosynthesis! and melanocyte migration'®, respectively) . 
Dichotomized model predictions for genes ADJPOQ, PLINI, and PKP3 are visualized across 
sample #14 and #178 (Figure 2), demonstrating spatial concordance between true and predicted 
expression. In the continuous prediction task, models achieved a macro-averaged median Spearman 
coefficient of 0.60 across the testing sets (Supplementary Table 1). The model performed best on 
genes KRT14, CXCL14, and COL1A2 (involved in epithelial cell integrity'’, keratinocyte function’, 
and collagen synthesis!’, respectively) and worst on genes CKM, MYLPF, and ODF 2] (the former 
two are known to be involved in energy homeostasis? and muscle development?!). Continuous 
model predictions for genes KRT14, CXCL14, PI16 were visualized across samples 107 and 167 
(Supplementary Figure 2), demonstrating spatial concordance between true and predicted 
expression. Note that models in both the dichotomized and continuous prediction tasks were trained 
to predict the same set of 1000 spatially variable genes. 


wsi Gene Ground Truth Predicted wsi Gene Ground Truth Predicted 


Figure 2: Dichotomized RNA Expression Prediction. Dichotomized spatial gene expression was inferred 
for (A) samples #14 and (B) #167 and compared with the respective ground truths. A spot is colored yellow 
if gene expression in this spot exceeds global mean gene expression. Performance is displayed for the top 
performing genes, ADIPOQ, PLIN1, and PKP3, which achieved macro-averaged AUC values of 0.942, 
0.938, and 0.918, respectively. 


* Supplementary materials can be found at the following DOI: https://doi.org/10.528 1/zenodo.8197850 
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3.2. Pathway Analysis 


For each performance decile, the top 3 most salient biological pathways by combined score were 
determined (Supplementary Table 2). Across both the dichotomized and continuous prediction 
tasks, biological pathways associated with the top performance decile (i.e., 90" to 100" percentile 
genes ranked by performance) pertained to skin and epidermis development and maintenance, skin 
cell proliferation, and the regulation of extracellular matrix and cell-cell adhesion (Table 1). By 
contrast, biological pathways associated with genes in the worst performance decile (i.e., 0" to 10 
percentile genes ranked by performance) across both the dichotomized and continuous prediction 
were far less associated with relevant biological phenomena, pertaining to immune signaling, cell- 
turnover regulation, gas transport, and muscle cell development (Table 1). More generally, 
biological pathways associated with higher-performing genes tended to be more closely related to 
skin development, differentiation, pigmentation, and fat metabolism, while distinct trends were less 
clear for those biological pathways associated with lower-performing genes (Supplementary Table 
2). 


Table 1: Performance Pathway Analysis. Combined performance statistics for both the dichotomized and 
continuous models were used to perform a performance-stratified pathway analysis. AUC and Spearman 
coefficient were used to stratify genes in the dichotomized and continuous tasks, respectively. The top 3 pathways, 
measured via EnrichR using the Go Biological Process 2023 database, are reported for the highest and lowest 
performance deciles. Refer to Supplementary Table 2 for an extended version of this table detailing all 
performance deciles. 


Gene Performance Task Pathway Score Overlap P-value 
Top performing Dichotomized Establishment of Skin Barrier 2127 6/19 6.9E-10 
genes: 90-100" Skin Epidermis Development 1785 6/21 4.0E-04 
Percentile Genes Keratinocyte Proliferation 803 2/6 1.4E-11 
Continuous Positive Regulation of 2725 4/9 7.3E-08 
Epidermis Development 
Desmosome Organization 2654 3/6 2.4E-06 
Intermediate Filament Bundle 1905 4/7 4.2E-06 
Assembly 
Bottom performing Dichotomized Interleukin-2-Mediated 615 2/7 2.8E-04 
genes: 0-10" Signaling Pathway 
Percentile Genes Cellular Response to 615 2/7 5.1E-04 
Interleukin-2 
Negative Regulation of T Cell 615 2/7 5.1E-04 
Apoptotic Process 
Continuous Gas Transport 504 3/15 5.3E-05 
Positive Regulation of 493 2/8 6.7E-04 
Respiratory Burst 
Regulation Of Skeletal Muscle 493 2/8 6.7E-04 
Cell Differentiation 


The top 100 differentially expressed genes for each histological sub-type were determined for 
both ground truth and predicted data samples, and these genes were leveraged for further pathway 
analyses. The pathway analysis results for the top 100 differentially expressed genes for 
dichotomized and continuous gene expression data are reported in Supplementary Table 3 and 
Supplementary Table 4. Across both dichotomized and continuous gene expression data, pathways 
associated with the formation of the sebaceous gland and epidermis were found to be in high 
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agreement between ground truth and predicted expression, while the agreement was more modest 
other in histological features (Table 1; Supplementary Table 3). 


3.3. Topological Consistency 


3.3.1. Leiden Clustering 


A visual inspection of the aligned-UMAP diagrams demonstrates similar clustering patterns and 
topological consistency between the predicted and the ground truth expression data across both 
models trained for dichotomized and continuous regression tasks (Figure 3; Supplementary 
Figure 2). We noted that the Leiden clusters assigned to the ground truth expression were similar 
to those assigned to predicted expression embeddings. Nonetheless, differences remained. We did 
not observe complete separation in the predicted expression embeddings, representing a fuzzier or 
more connected/intermediate topological structure (Figure 3; Supplementary Figure 2). These 
spots in the predicted data were, accordingly, located between Leiden clusters more often than spots 
in the ground truth genetic data, where Leiden clusters tended to be far more spatially distinct. This 
feature was noted for both dichotomized and continuous expression models, although this pattern 
was more prevalent for dichotomized expression (Figure 3; Supplementary Figure 2). 

Model predictions in both the dichotomized and continuous expression tasks also preserved the 
general shape of ground truth genetic data while plotted across each whole slide image 
(Supplementary Figures 3 and 4). We further observed, however, that predicted data tended to 
contain genetically intermediate states, as evidenced by the greater number of Leiden clusters 
produced in the predicted data compared to the ground truth data while using the same Leiden 
clustering resolution (Supplementary Figures 3 and 4). Though models in both tasks produced 
data that captured larger macro-architectural differences in gene expression found across skin tissue, 
the dichotomized model tended to produce data that more closely preserved the relationships 
determined using Leiden clustering plotted across the slide found across the ground truth data 
(Supplementary Figure 3). Models in the continuous expression task, though high performing, 
tended to produce data that recapitulated the spatial genetic variation of macro-architectural features 
in skin tissue less well, evidenced, again, by disparities in the number and placement of Leiden 
clusters when comparing the predicted and ground truth data (Supplementary Figure 4). 


Ground Truth Predicted Ground Truth Predicted 


Figure 3: Dichotomized Expression Topological Analysis. (A) From left to right, ground truth spatial 
Leiden clustering, ground truth aligned UMAP, and predicted aligned UMAP for sample #14. (B) From left 
to right, ground truth spatial Leiden clustering, ground truth aligned UMAP, and predicted aligned UMAP 
for sample #178. In both rows, the same spots are colored identically according to their ground truth gene 
expression profiles after Leiden clustering analysis. Dichotomized gene expression data was used here. The 
Leiden clustering resolution was set to 0.2. 
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3.3.2. Histological Annotations 


Performing Aligned-UMAP on the ground truth and predicted expression data for Visium spots 
tagged by histological structures demonstrated that embeddings in both groups clustered by distinct 
histological regions of skin tissue (Figure 4; Supplementary Figure 5). That is, Visium spots 
corresponding to similar histological structures clustered in similar locations across both UMAP 
plots, preserving the genetic relationships between these histological architectures. The distinctness 
of these clusters was preserved for both dichotomized and continuous gene expression predictions, 
though predicted continuous expression data appeared to preserve the topology better than 
dichotomized gene expression data (Figure 4; Supplementary Figure 5). 


Ground Truth Predicted 


Eccrine gland 
Epidermis 

Hair follicle 
Lymphatic/vascular 
Sebaceous gland 


Figure 4: Dichotomized Expression Histological Analysis. Aligned-UMAP procedure was used to reduce 
the dimensionality of both the ground truth and predicted dichotomized gene expression vectors for (A) 
sample #14 and (B) sample #178. Spots are colored according to their histological annotations. 


4. Discussion 


In this work, we developed a set of spatial gene expression inference models for histopathologically 
normal skin tissue in the context of molecular changes associated with photoaging. We make use of 
the novel CytAssist co-registration/imaging technology, allowing for 40X resolution imaging of 
tissue slides. Beyond quantitative validation of performance (e.g., AUC, F1, Spearman coefficient), 
we also reaffirmed the biological relevance of the predicted expression pattern. In particular, 
extracted histological features from our models remained faithful to underlying biological pathways, 
buttressing their potential use across a range of biological inference tasks, and lending credibility to 
their role in democratizing the spatial transcriptomics paradigm to the broader research community. 

With the Visium CytAssist technology, our models were trained with histological information 
at >4 times the spatial resolution of previous studies. We achieved comparable performance to a 
prior study that utilized an Inception convolutional neural network for both dichotomized and 
continuous prediction of gene expression. While acknowledging the need for caution when 
comparing these models, as they represent different biological domains (Skin vs Colon; different 
genes used in the models), observed performance disparities may arise from variations in 
methodology. These differences could include the utilization of distinct imaging resolutions (40X 
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vs 20X) or the selection of different modeling approaches (ZINB vs raw log gene expression). Since 
our predictions were made within a more focused visual receptive field, disregarding the 
surrounding wider tissue architecture, future work can explore the examination of larger-scale 
histological context. 

The pathway and topological analysis provided useful insights on the nature of spatial RNA 
inference from histology. It is important to highlight that our findings suggest that genes with a clear 
histological basis are more likely to be accurately predicted compared to genes lacking such a 
theoretical histological foundation. Our results also demonstrate the importance of developing 
models germane to the biological question at hand: a model trained on colon tissue is not expected 
to perform well on skin tissue. Hence, investigating modeling approaches that prioritize specific 
biological phenomena emerges as a promising direction for future research. Genes that demonstrate 
good performance, or effectively recapitulate histological patterns, could potentially be utilized for 
further research applications across larger cohorts. 

The topological analysis demonstrated that predicted expression profiles did not cluster as 
distinctly as the original expression patterns. When coloring by Leiden cluster affiliation and 
histological association, the predicted spot level gene expression fell between ground truth clusters, 
representing intermediate histological states learned by the neural networks. Future work may seek 
to understand which histomorphological features relate to different molecular pathways through 
newly established interpretation approaches”’. The integration of coregistered slide imaging with 
spatial molecular information can facilitate such analyses. Moreover, by subsetting predicted and 
true gene expression based on shared molecular pathways (such as genes involved in epithelium 
development, cell-cell junctions, immune function, etc.) and conducting comparable topological 
analyses, it is possible to identify the molecular pathways that exhibit the highest degree of 
topological distinctiveness. Nevertheless, topological analyses have emerged as a timely and 
relevant topic in the realm of single-cell and spatial analyses, offering the potential to uncover 
additional dimensions of cellular and histological heterogeneity”. 

This study reinforces the potential of spatial transcriptomics approaches for research and clinical 
applications. For example, photoaging, which is linked to skin cancer risk, lacks reliable 
measurement tools due to variations in histological assessments and self-reported UV exposure. 
Existing analyses typically focus on specific cellular components (e.g., dermal fibroblasts, elastosis, 
keratoses), often disregarding or unaware of photoaging-related factors. Expanding spatial 
molecular findings through RNA inference to a larger cohort can help identify cell-type specific 
sources of photoaging in specific tissue architectures while controlling for numerous potential 
confounders and presents an intriguing area of follow up given the models established in this study. 
Spatial RNA inference can uncover novel cellular components related to precancerous alterations 
resulting from chronic sun exposure. By targeting profiling of these tissue regions, researchers can 
explore residual heterogeneity, while examining cell-type specific alterations and additional factors 
related to accelerated aging, with the caveat that tissue from this cohort is histologically adjacent to 
surgical site of repair, potentially harboring a cancerization field effect. 

In clinical practice, the use of virtual RNA models has the potential to inform treatment planning 
and assess treatment response. If these models can identify proxy measures of photoaging, spatial 
molecular inference can be employed to evaluate the effectiveness of skin therapeutics through 
quantitative assessment of biomolecular changes at screening, baseline, and endpoint. This approach 
offers a more objective and quantitative measurement of the impact of treatments on skin health and 
can enhance the validity of therapeutic interventions. Similarly, applications are envisioned for 
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treatment of non-healing skin ulcers and separately hair loss driven by an autoimmune response 
(e.g., alopecia areata), revealing potential components of relevant immune polarity (e.g., M1/M2 
macrophage balance, etc.).7°*” Additionally, virtual RNA inference models can inform disease 
management options for various solid tumors, functioning similar to immunohistochemical assays 
(e.g., immunoscore) that shed light on the infiltration of cytotoxic immune cell lineages, identifying 
independent risk factors of tumor recurrence and survival. Spatial molecular assessments can 
identify targetable therapeutic pathways for personalized treatment options. 

This study is not without limitations that can direct for future research. Our sample size was 
small, limiting our ability to account for potential variability in histology and surgical sites. 
Additionally, the non-biopsied nature of the samples and their proximity to potentially precancerous 
tumor tissue may introduce differences in gene expression related to factors other than UV exposure. 
Introducing matching normal control tissue, considering factors like limited sun exposure and low 
field effect potential, along with expanding the cohort to control for additional age ranges, sex, and 
tissue site, could help reveal photoaging differences specific to these groups. Skin tone is another 
confounding factor that should be addressed, and it can be controlled using measures such as the 
Fitzpatrick skin phototype scale or derived continuous measures. To improve rigorous quantitative 
photoaging assessments, various measures of photoaging can be combined using factor analyses, 
leading to meaningful composite measures, such as DNA methylation, age-related measures, 
elastosis, and UV questionnaires. Additionally, one general limitation of our topological analyses 
included their more qualitative, rather than quantitative, nature. Shifts in distribution between 
ground truth and predicted Visium spot topology could also be captured using more nuanced 
mathematical notions such as the KL-divergence, Wasserstein distance, maximum mean 
discrepancy, and silhouette score. Addressing these limitations and incorporating a more diverse 
and extensive sample size can enhance the reliability and applicability of future studies in this field. 


5. Conclusion 


Machine learning technologies that can infer spatial molecular information from routine tissue stains 
have the potential to facilitate low-cost accessible spatial transcriptomic assessments for large scale 
molecular epidemiological studies. Such studies can uncover novel risk factors of early 
photocarcinogenesis or inform relevant treatment/therapeutic options by expanding the set of 
targetable molecular pathways within specific tissue architectures. Our skin study sets the stage for 
larger-scale studies to identify spatial molecular correlates of skin sun damage and evaluate novel 
therapeutics that may reverse this damage. While our models exhibited impressive performance in 
predicting dichotomized and continuous gene expression within tissue slides, it is crucial to 
acknowledge the need for further development and validation of this approach. When utilizing these 
algorithms, it is important to consider the genes that are known to be influenced by histological 
characteristics. Additionally, any novel findings obtained through these tools should be corroborated 
and validated using well-established immunostaining techniques, ensuring the reliability and 
robustness of the results. 
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PEPSI: Polarity measurements from spatial proteomics imaging suggest immune cell 
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Subcellular protein localization is important for understanding functional states of cells, but 
measuring and quantifying this information can be difficult and typically requires high-resolution 
microscopy. In this work, we develop a metric to define surface protein polarity from 
immunofluorescence (IF) imaging data and use it to identify distinct immune cell states within 
tumor microenvironments. We apply this metric to characterize over two million cells across 600 
patient samples and find that cells identified as having polar expression exhibit characteristics 
relating to tumor-immune cell engagement. Additionally, we show that incorporating these 
polarity-defined cell subtypes improves the performance of deep learning models trained to predict 
patient survival outcomes. This method provides a first look at using subcellular protein expression 
patterns to phenotype immune cell functional states with applications to precision medicine. 


Keywords: subcellular localization, proteomics, multi-plex immunofluorescence 


1. Introduction 


Spatial proteomics methods such as immunofluorescence (IF) and immunohistochemistry (IHC) 
enable an unprecedented view of tumor microenvironments by preserving the spatial structure of 
tissues at subcellular resolution’. However, standard analyses aggregate and average protein 
expression within segmented single cells, discarding sub-cellular and morphological signals’. This 
approach introduces a number of analytical limitations. First, segmentation can be imprecise. 
Second, subcellular protein expression patterns could allow the inference of cellular functional 
states (i.e. polarized vs. uniform). Thus, while cells can be phenotyped in the context of their 
spatial neighbors, cells that exhibit differential protein localization are not differentiated. 

The relationship between protein localization and function is well-established in many 
contexts. For instance, during T cell engagement with presented antigens (e.g., on tumor cells), 
the CD4 and CD8 coreceptors are recruited to the immune synapse, while they present uniformly 
on the surface of a cell in a naive or exhausted state *°. The immune synapse, also known as the 
supramolecular activation cluster, is a specialized junction formed by many proteins during an 


immune response. In the context of a tumor immune response, active and engaged T cells could 


© 2023 The Authors. Open Access chapter published by World Scientific Publishing Company 
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correlate with better patient survival, whereas the presence of exhausted or inactive T cells may be 
indicative of worse outcomes °*. However, it is unknown to what extent this or other such 
dynamic subcellular localization events are discernable from whole slide scale histology images. 


13 utilize computer vision models for the automatic 


Common analyses of cell morphology 
extraction of image features from tissue patches. However, interrogating such models for specific 
cell-cell interactions is difficult. Previous work toward characterizing surface protein localization 
'415 "polarity 


localization measurements in mRNA ''’, and co-localization with protein expression "°. 


includes statistical methods for identifying ligand-receptor pairs in transcriptomics 


In this paper, we present a novel approach, PEPSI (Protein Expression Polarity Subtyping 
in Immunostains), for measuring subcellular protein localization toward characterizing the tumor 
microenvironment. We describe a simple, explainable method for computing the polarity of cell 
surface biomarkers. We apply this metric on multiple large-scale CODEX (co-detection by 
indexing) datasets spanning over two million cells, three clinical sites, and 600 patient samples. 
We focus on several key immune cells that are well-characterized and known to express polarized 
surface protein markers during activation/engagement. We define additional cell subtypes relating 
to morphology (polarized, uniform) for representative biomarkers (CD8, CD4, CD20) of immune 
cells (T cells and B cells). We find that surface protein marker polarity is significantly correlated 
with positive patient outcomes, even after controlling for various technical artifacts, suggesting 
that this may be important for characterizing the functional state of immune cells. We believe that 
inferring functional subtypes of cells can offer a better understanding of patient response to drug 


treatments and disease prognostic indicators. 
2. Results 
2.1. Polarity measurement 


We describe a straightforward method for extracting polarity measurements for a given cell based 
on a polar transformation of the IF signal with respect to the cell centroid (Figure 1A, Methods). 
Plotting the distribution of scores for four markers in their cognate expression cell types - CD8 in 
CD8 T cells, CD4 in CD4 T cells, CD20 in B cells, and PanCK in tumor cells - shows that the 
scores exhibit continuous distributions (Figure 1B). The first three biomarkers, which are known 
to polarize in cells undergoing immune activation, show significantly higher average polarity 
scores versus PanCK, which is not known to polarize as such. To obtain a discrete polarity 
classification, we threshold the raw scores based on an empirical heuristic (Methods), obtaining 
proportions for polar, uniform, and ‘other’ cells for each of the three immune cell types (Figure 
1C). For instance, polar cells account for 3.7%, 3.0%, and 2.6% of CD8 T cells, CD4 T cells, and 
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B cells, respectively. Example cells were randomly inspected in their cell contexts to visually 
validate the classifications (Figure 1D). 


A: Schematic of polarity measurement algorithm B: Polarity distribution 
rae ee & thresholds 
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Fig. 1: Overview ofthe PEPSI polarity measurement framework. Panel A: Schematic of polarity 
measurement algorithm. For a given mIF sample, patches (40px by 40px) are extracted around each cell. For 
each cell, a polar transform is computed on the patch, followed by summing along the y-axis and then 
computing the area under the CDF curve, yielding a polarity score (from 0 to 1). Panel B: The polarity score 
histograms are shown for CD8, CD4, CD20, and PanCK biomarkers in CD8 T cells, CD4 T cells, B cells, 
and tumors, respectively. The orange (left) line indicates the upper threshold chosen for identifying uniform 
cells, whereas the green (right) line indicates the lower threshold chosen for identifying polar cells. Cells in 
between two thresholds are indicated as ‘Other’. Panel C: The percent proportions of polar, uniform, and 
other cells across the three cell types and their key biomarkers. Panel D: For CD8 T cells, representative 
examples of polar and uniform CD8 T cells are shown, and color-coded by relevant biomarkers. 


2.2. Polarized cell neighborhoods are more enriched with tumors 


Next, we explore whether polar immune cells might exhibit differences in their cellular 
neighborhoods with respect to uniform cells of the same type. Examples of a polarized CD8 T cell 
adjacent to a tumor cell (left) or not (right) are shown in Figure 2A. We found that, for CD8 T 
cells, CD4 T cells, and B cells, tumor cells were consistently enriched in the immediate cell 
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neighborhoods of polar cells versus uniform cells (Figure 2B). Conversely, we also find that cells 
with tumor cell neighbors are more likely to be polar (Supp. Table 1). Given that polar expression 
can indicate antigen engagement during contact with tumor cells '’, this provides evidence that 
polarity is a biologically significant biomarker. 


A: Examples of CD8T cells and their neighbors 


Polar CD8 T cell engaged with tumor Non-engaged uniform CD8T cell 
eo Tcell DAPI 
CD8 
CD4 
"o CD45 
CD20 
CD8T cell m PanCK 
Diagram CODEX image Diagram CODEX image 
B: Cell types enriched near polar vs uniform immune cells 
CD8T cell CD4T cell B cell 
B cell 
CD8 T cell CD4T cell 
Macrophage 
Tumor Naive immune 
Tumor Tumor 
Vessel 
iog2(fold charge) log2(fold change) log2{föl change) 
Enriched near Enriched near Enriched near 
Uniform cells Polar cells Uniform cells Polar cells Uniform cells Polar cells 


Fig 2: Tumor cells are more present next to polar cells versus uniform cells. Panel A: Diagrams and mIF 
images illustrating two possible states of CD8 T cells. Left: A CD8 T cell engaged with a tumor cell, with 
polar expression of CD8 at the immunological synapse. Right: A uniformly expressed CD8 T cell, with no 
tumor engagement. Panel B: Since polarity may be indicative of tumor engagement, we measure the cell 
type composition of neighborhoods around polar versus uniform cell types. We find that tumor cells are 
consistently and significantly more enriched in polar cell neighborhoods versus uniform cell neighborhoods 
for all three immune cell types. We compute bootstrapped 95% confidence intervals for each neighboring 
cell type and only show cell types with significant log fold changes. 
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2.3. Metric control experiments 


In addition to visual validation, we perform tests to validate that the metric distribution is not 
explained by simple technical covariates or noise. We find that polarity cannot be simply explained 
by significantly more crowded cell neighborhoods (Supp Figure 1A) or differences in cell size (R2 
of 0.08, Supp Figure 1B). During antigen engagement, multiple biomarkers are known to jointly 
express at the site of the immunological synapse *’. Supp. Figure 1C measures the correlation of 
polarity scores between all pairs of biomarkers as expressed in all T cells, B cells, and tumor cells. 
CD3e, a biomarker known to express during engagement, is jointly polarized with CD4 and CD8, 
while, PanCK, a biomarker not known to be active during antigen engagement, does not correlate 
with CD20, CD3e, CD4, or CD8. We note that the observed co-polarity between CD4 and CD8 is 
likely due to expression from neighboring cells that are being captured by our algorithm as 
originating from the same cell, an artifact that occurs in a small fraction of T cells (Supp. Figure 
1D). 


2.4. Polarity-defined cell types improve model prediction of survival outcomes 


To demonstrate that the newly classified polar or uniform cell subtypes have biological or clinical 
relevance, we utilize deep learning models to predict patient survival from cell phenotypes. We 
train two models: a 3-layer multi-layer perceptron (MLP) neural network, which takes as input the 
percent composition of cell types per sample and predicts a binary outcome (five-year survival); 
and a graph-based neural network (GNN) that takes as input a 3-hop neighborhood of cells centered 
around a single cell, and predicts the survival status of the sample from which the neighborhood of 
cells originated. Both models show modest but consistent improvement in performance across three 
distinct studies and two disease types after including the 6 new cell types (Table 1). Supp. Table 2 
shows ablations where the MLP model is trained on each polar/uniform cell type individually. Of 
note, a model is trained with Ki67 polarity in tumor cells as a negative control (since Ki67 is not 
known to express polarly) and demonstrates no improvement over the baseline. Finally, we use the 
percent of polar cells per sample and compute the AUROC in its usefulness in predicting survival 
outcomes in Supp. Table 3 and find that even this simple metric alone has predictive accuracy 
above chance. 
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Survival status UPMC-HNC Stanford-CRC DFCI-HNC (using UPMC-HNC 
model) generalization 


0.759 (0.053) 0.538 (0.132) 0.655 (0.074) 


- Adding 
polarity-defined cell 
subtypes 0.803 (0.049) 0.559 (0.135) 0.672 (0.072) 


C S 
0.839 (0.048) 0.684 (0.092) 0.853 (0.046) 


- Adding 0.856 (0.045) 0.743 (0.090) 0.880 (0.051) 
polarity-defined cell 
subtypes 


Table 1: Adding polarity-specific cell types improves patient survival prediction in machine learning 


models. To validate the usefulness of the polarity-specific cell types derived from our polarity measurement 
method, we train two models to predict patient survival status, with and without the additional cell types 
(polar and uniform CD8 T cells, CD4 T cells, B cells). Each cell represents the AUC of the model’s 
prediction of patient survival. We observe that adding the additional cell types improves model performance 
across three datasets, three clinical sites, and two disease types. Standard deviations of bootstrapped samples 
are reported in parentheses. Predictions are generated at the sample level. 


2.5. Presence of polar cells improves patient survival with in silico models under label and 
spatial permutations 


In the permutation experiments shown in Figure 3 and Supp. Table 4, the GNN model predicts 
significantly worse survival in tumor microenvironments where the subtype of the immune cells 
are flipped from polar to uniform (Figure 3B). The inverse is also true; the predicted survival 
improves when cells are flipped from uniform to polar. Even when fixing the cell type composition, 
dispersing the location of the immune cells away from the tumor cells results in a decrease in 
predicted survival (and vice versa) (Figure 3C). These results suggest that polar immune cells are 


important not simply for their presence in a sample, but for their proximity to tumor cells. 
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A: Diagram of GNN prediction SPACE-GM GNN diagram from Wu et al. 2022 
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Fig 3: In silico experimentation reveals that polar cells are correlated with positive patient outcomes. Panel 
A: A schematic of a graph neural network described in Wu et al. 2022. A mIF sample is represented by a 


Average predicted difference 
between clustered and dispersed 
+0.1 


E Polar 
M8 Uniform 


cD8 CD4 CD20 


Voronoi diagram, which is projected into a spatial graph. A graph neural network is trained to predict 
survival outcomes based on 3-hop cellular neighborhoods. Panel B: Using a trained GNN, we perform label 
permutation on each sample graph, where the subtype of each immune cell is flipped to either polar or 
uniform, and the averaged model prediction is measured. Even when fixing spatial neighborhoods, we 
observe an increased predicted survival probability when cells are polarized, and a slight decrease when 
cells are turned into uniform states. Panel C: Now, fixing the cell types, polar and uniform cell 
neighborhoods are sampled and spatially permuted. We observe, on average, a larger increase in predicted 
survival when polar cells are dispersed from the clustered state than with uniform cells. 


3. Methods 
3.1. Datasets 


Our primary dataset consists of 308 samples from 81 patients with head and neck squamous cell 
carcinomas at the University of Pittsburgh Medical Center (UPMC-HNC). Two external validation 
datasets are used: a colorectal cancer dataset with 292 samples from 161 patients from Stanford 
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University (Stanford-CRC) to demonstrate generalization to another disease; and a head and neck 
squamous cell carcinomas dataset with 112 samples from 29 patients from Dana Farber Cancer 
Institute (DFCI-HNC) to demonstrate generalization to an additional clinical site. The number of 
samples, patients, coverslips, and total cells in each dataset is described in Supp. Table 5A. 
Phenotype annotations for UPMC-HNC are described in Supp. Table 5B. Full CODEX data 
acquisition and preparation details are described in Supp. Methods. UPMC-HNC is chosen as the 
primary training and evaluation dataset as it contains the largest number of samples, coverslips, and 
total cells. We evaluate our models on held-out coverslips not seen during training to assess model 
robustness to technical artifacts across coverslips. 


The UPMC-HNC and Stanford-CRC datasets have one held-out coverslip for model 
validation and one held-out coverslip for model evaluation. The Stanford-CRC dataset has half of 
one coverslip randomly split and held out for model validation and one held out for model 
evaluation. The DFCI-HNC dataset has one coverslip randomly split by patients for model 


evaluation. 
3.2. Biomarker expression preprocessing 


Single-cell expression was computed for each biomarker by 1. applying a deep learning cell 
segmentation algorithm (DeepCell) *! on the DAPI biomarker channel (nuclear stain) to obtain 
nuclear segmentation masks; 2. successively dilating segmentation masks by flipping pixels each 
time with a probability equal to the fraction of positive neighboring pixels (repeated 9 times); 3. 
computing the mean expression value across pixels within the single cell; and 4. normalizing the 
expression values across all cells in a sample using quantile normalization and arcsinh 


transformation followed by a z-score normalization: 
score(arcsinh(——— 
zscore(aresinh(=Z5)) 


Where zscore is defined given u and o, the mean and standard deviation across all cell 


expression values in the sample: 
x= 
oO 


zscore(x) = 


x is the vector of a biomarker's values in a sample, arcsinh is the inverse hyperbolic sine 
function; and q A „&) is the 20th percentile of x. 
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3.3. Image patch generation 


After preprocessing (tile & cycle alignment, stitching, deconvolution, and background correction) 
CODEX data is available as multichannel OME-TIFF files, with each image channel corresponding 
to the fluorescence signal (expression) of a distinct biomarker probe. To prepare the input image 
patches for the deep learning model, we perform the following: All pixel values for a biomarker in 
a sample are normalized using ImageJ’s AutoAdjust function. 


3.4. Cell type ground truth and predictions 


To produce cell type labels, we first obtained a cells-by-features biomarker expression matrix - for 
each marker, we took the average signal across all pixels in a segmented cell. This matrix was 
normalized and scaled as described above, and then principal component (PC) analysis was 
performed. We constructed a nearest-neighbor graph (k = 30) of cell expression in PC space with 
the top 20 PCs, then performed self-supervised graph clustering ” on the result. Clusters were 
manually annotated according to their cell biomarker expression patterns. This procedure was 
performed on a subset of 10,000 cells and subsequently used to train a KNN algorithm. This 
algorithm was used to transfer labels to the entire dataset. The cell type labels that were used are: 
Tumor (CD15+, CD20+, CD21+, Ki67+, Podo+, Other), Naive immune cell, Granulocyte, Vessel, 
CD4 T cell, Macrophage, CD8 T cell, Stromal / Fibroblast, APC, Lymph vessel, and B cell. 


3.5. Calculating polarity score 


Our polarity measurement methodology is described in Figure 1. The segmentations are used to 
calculate cell center coordinates. For each cell, a 40px square patch is extracted around the center 
pixel. Several de-noising steps are first taken: 1. low/background values are zeroed out (values < 
0.1), 2. biomarker expressions that spatially overlap with the DAPI channel are subtracted out in 
both the center and neighborhood cells. 


We then transform the patch from cartesian coordinates to polar coordinates using the scikit 
package (skimage.transform.warp_polar). The polar image is then summed along the y-axis, 
producing a 1-dimensional vector. An additional refining step is taken where cells are assigned 
‘other’ if the vector 1. sums to 0 either along the x- or y-axis, 2. does not contain multiple unique 
values, or 3. has a mean less than 0.02. Finally, the vector is normalized within a [0,1] range and 
sorted in ascending order, and a score is computed by subtracting the AUC of the sorted vector 
from 1. 


On its own, the polarity score is difficult to interpret and incorporate into existing analysis 
pipelines that rely on discrete cell phenotypes. Thus, we define three cell subtypes based on the 
polarity score value: uniform (cells with a polarity score below a threshold), polar (cells with a 
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polarity score above a threshold), and other (cells that fall in between both thresholds). To obtain 
optimal thresholds for defining polar and uniform cell types from the polarity scores, we perform a 
two-dimensional grid search on the MLP model and select the pair of values that yielded the 
highest validation AUC score in the survival prediction task. From this process, we obtain 0.94 and 
0.8 as the polar and uniform threshold cutoffs, respectively (Figure 1B). Figure 1C shows the 
polarity distributions after thresholding. 


These thresholds are used to define six new cell subtypes for polar and uniform CD8 T 
cells, CD4 T cells, and B cells. CD8, CD4, and CD20 were used as the representative surface 
biomarkers for each of the three cell types, respectively. These three cells and biomarkers were 
chosen as they are known in the literature to exhibit polar expression during engagement ”. 


3.6. Machine learning models 


We train two machine learning models to evaluate the benefit of including the six newly defined 
cell subtypes. First, we use a 3-layer multilayer perceptron (MLP) neural network that accepts the 
percent composition of cell types per sample and predicts binary 60-month survival. Each layer 
contains 256 nodes followed by a LeakyReLU * activation function. Each model is trained with 
binary cross-entropy loss across 200 epochs and a learning rate of 0.001. 


Second, we train a graph-based neural network (GNN) ™ that takes as input 3-hop cell 
neighborhoods and predicts neighborhood-level survival status (Figure 2A). This model transforms 
the structure of each sample into a graph network, where cells are connected by edges to 
neighboring cells. It then pools information about the neighboring cells’ cell types to output an 
outcome probability score for each cell. The sample predictions are generated by averaging the 
scores across all cells in that sample. Model training details follow the procedures described in Wu 
et al. ~: 


Each of the models is trained first on the original 16 cell types (baseline) and then trained 
using the 6 additional cell types. In both of these settings, each model is trained and evaluated on 
the UPMC-HNC and Stanford-CRC datasets. An additional evaluation is performed on the 
DFCI-HNC dataset using the UPMC-HNC trained model. 


3.7.. Permutation experiments 


To assess the effect of polar/uniform cell types on the GNN model’s survival predictions, we 
perform several permutation experiments. In the first experiment (Figure 3A), we flip the cell type 
label of all immune cells to either polar or uniform and evaluate the predicted survival probability 
in each scenario. In the second experiment, we sample random subgraphs containing immune cells 
and tumor cells for CD4 T cells, CD8 T cells, and B cells. Then, we flip all immune cells to either 
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polar or uniform and perform a spatial permutation, where we shuffle the immune cells into either 
clustered (where all immune cells are neighbors) or dispersed (where immune cells are randomly 
located in the subgraph) orientations and evaluate the predicted probabilities for each orientation. 


4. Discussion 


We describe a robust, interpretable subcellular morphology metric that reflects macro-biological 
states. Although our results do not conclude that these polarity events definitively quantify immune 
synapses, they do suggest that such measurements represent biologically relevant signals in tumor 
microenvironments. Though our described method is performed on CODEX data, it can similarly 
be applied to other lower-plexed imaging techniques like IHC that include a nuclear marker (i.e. 
DAPI) and one or more surface biomarkers. 


To date, there has not been prior consensus demonstrating that biological events like 
engagement and activation or exhaustion can be reliably observed at the standard resolution of mIF 
imaging. Potential confounders include bleedover, sample slicing artifacts, measurement noise, cell 
size, and density of neighboring cells. We address this by conducting several negative control 
experiments and find that these factors alone do not adequately explain the signal present in our 
polarity measurements. 


One counter-hypothesis is that polarity measurements serve as a proxy for neighborhood 
information -- i.e. the presence of certain cell types or spatial arrangements. Another possibility is 
that they are primarily an imaging artifact (for instance, irregular borders due to slicing). To test 
these, we trained a GNN that incorporates local neighborhood information into its predictions and 
then introduced the polar and uniform cell types. The results show that the new cell types improve 
performance even in models that have access to cell neighborhood information, suggesting that it 
introduces information beyond the neighborhood cell type composition or spatial arrangement of 
cells. 


Further experimental evidence is required to define these observations as a specific 
biological phenomenon, i.e. T cell engagement. However, we believe that this work provides 
evidence of the importance of measuring and incorporating subcellular polarity information into 
tissue microenvironment analyses, and represents an important step toward a personalized 
understanding of disease states, drug response, and patient prognosis. 


Supplementary Materials: All supplementary tables, figures, and data are available at: 


https://docs.google.com/document/d/In97PEC2kq4 1fNOWMXrOASyd42DZ0_nUapnEs6HoeJ8c 
Code Availability: Code for replicating the experiments in this paper is present in this code 


repository: https://gitlab.com/enable-medicine-public/polarity 
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KombOver: Efficient k-core and K-truss based characterization of perturbations 
within the human gut microbiome 
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The microbes present in the human gastrointestinal tract are regularly linked to human 
health and disease outcomes. Thanks to technological and methodological advances in re- 
cent years, metagenomic sequencing data, and computational methods designed to analyze 
metagenomic data, have contributed to improved understanding of the link between the 
human gut microbiome and disease. However, while numerous methods have been recently 
developed to extract quantitative and qualitative results from host-associated microbiome 
data, improved computational tools are still needed to track microbiome dynamics with 
short-read sequencing data. Previously we have proposed KOMB as a de novo tool for 
identifying copy number variations in metagenomes for characterizing microbial genome 
dynamics in response to perturbations. In this work, we present KombOver (KO), which 
includes four key contributions with respect to our previous work: (i) it scales to large 
microbiome study cohorts, (ii) it includes both k-core and K-truss based analysis, (iii) 
we provide the foundation of a theoretical understanding of the relation between various 
graph-based metagenome representations, and (iv) we provide an improved user experience 
with easier-to-run code and more descriptive outputs/results. To highlight the aforemen- 
tioned benefits, we applied KO to nearly 1000 human microbiome samples, requiring less 
than 10 minutes and 10 GB RAM per sample to process these data. Furthermore, we 
highlight how graph-based approaches such as k-core and K-truss can be informative for 
pinpointing microbial community dynamics within a myalgic encephalomyelitis/chronic fa- 
tigue syndrome (ME/CFS) cohort. KO is open source and available for download/use at: 
https: //github.com/treangenlab/komb 


Keywords: metagenomics; graph based methods; anomaly detection. 


1. Introduction 


Metagenomics, the study of the genomes of microbes that inhabit a microbiome, offers an 
unprecedented and highly granular view into the interaction between host-associated micro- 
biomes and host disease phenotypes. Numerous computational tools now exist to uncover the 
taxonomic composition and functional profiles of human host associated microbiomes [1—4]. Of 
particular relevance to this work, higher taxonomic and functional diversity of the microbiota 
is associated with healthy individuals, while lower diversity correlates with disease states [5-8]. 
Furthermore, with the growing number of metagenome assembled genomes (MAGs) [9, 10] the 
association between the genomic composition of microbial communities and the host health 


© 2023 The Authors. Open Access chapter published by World Scientific Publishing Company and 
distributed under the terms of the Creative Commons Attribution Non-Commercial (CC BY-NC) 
4.0 License. 
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has become better quantified and understood [11, 12]. However, metagenomic assembly from 
short reads remains a challenge in highly repetitive regions of bacterial genomes [13-15], and 
among closely related strains of a given bacterial species [16, 17]. Genomic repeats arising from 
horizontal gene transfer or duplication events have been associated with bacterial adaptation 
and evolution [18-20], functional diversification [20], and pathogenesis [18]. Recent advances 
in long-read sequencing offer a path to resolution of complex inter- and intra-genomic repeats 
in microbial communities [21, 22]. However, limitations in high molecular weight DNA extrac- 
tion [23] and financial cost of hybrid or high-quality long-read approaches poses a roadblock 
for large scale studies involving long-read sequencing. Additionally, a large existing corpus 
of metagenomic sequencing data consists predominantly of short paired-end reads, thus war- 
ranting the development of novel methods that can better capture and quantify inter- and 
intra-genomic repeat dynamics and flux. 

To address this challenge, we have previously proposed the software KOMB [24] to extract 
high copy number sequences of potential biological significance in the microbial communi- 
ties from the short paired-end read metagenomic sequencing data, expanding on prior ap- 
proaches [25-27]. As the genomic diversity of a bacterial community has been correlated with 
host health, we hypothesize that the corresponding inter- and intra-genomic repeat structures 
can act as a “biomarker” for host health. Our prior work highlighted the ability of KOMB 
to detect shifts in the microbial community associated with antibiotic treatment and bowel 
cleanse, as well as identify associations between observed shifts and key bacterial members of 
pre- and post-FMT bacterial communities. Additionally, similarly to de novo assembly meth- 
ods, KOMB is a database independent tool, and hence it avoids database biases [28]. However, 
unlike the common assembly approaches [29, 30] KOMB does not simplify the compacted de 
Bruijn graph, thus retaining the diversity originally present in the sequencing data. Further- 
more, in contrast to k-mer profiling methods [1, 2], KOMB offers a set of genomic sequences 
that can be annotated for downstream analyses. Thus, KOMB bridges the gap between fast 
profiling methods that either require a database or do not yield sequence units that can be 
readily annotated, and computationally expensive assembly-based approaches. 

For the purpose of identifying key sequences in the graph, KOMB employs the graph 
mining concept of k-core decomposition, which iteratively determines densely connected graph 
components. Previously, we had not investigated the set of sequences contained in the core of 
the graph as a whole, only focusing on sequences with high Core-A anomaly score [31] which 
captures deviations in coreness/degree ratios of a vertex. In KombOver (KO), we introduce 
and implement analysis of the maximal K-truss subgraph. Similarly to the vertices of the 
maximal k-core, the vertices of the maximal K-truss have been shown to have strong spreading 
(i.e. centrality) [32] which can be relevant in certain biological contexts as an alternative to 
betweenness centrality measure [25, 33]. 

One of the limitations of our prior work was its scalability to large metagenomic studies. 
In particular, the construction of the main data structure employed by KOMB, the hybrid 
unitig graph (HUG) incurred a high computational cost. It resulted in run times ranging from 
over an hour per single metagenomic sample, resulting in overwhelming computational costs 
for thousands to tens of thousands of samples. To address this limitation, in this work we pro- 
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pose a set of improvements to the HUG construction and analysis in KO aimed at enabling 
large-scale processing of genomic data and characterization of phenotype-associated dynam- 
ics. Furthermore, in addition to computational improvements, we provide a more extensive 
characterization of HUGs within the context of bacterial pangenomics, and draw parallels 
between pangenome graphs constructed from MAGs and HUGs. In order, to assess our tool, 
we analyzed short-read metagenomic sequencing data from a cohort of controls and patients 
with myalgic encephalomyelitis/chronic fatigue syndrome (ME/CFS), previously published by 
Xiong et al. [34]. Additionally, we have benchmarked KOMB on integrative human microbiome 
project [35] inflammatory bowel disease (IBD) cohort samples, as well as human genome se- 
quencing data from Genome in a Bottle project [36], to demonstrate KO’s scalability to both 
large data volumes, and complex repeat architectures. 


2. Results 
2.1. Hybrid unitig graphs 


Hybrid unitig graphs (HUGs) are an extension of compacted de Bruijn graphs [37-39] used 
as a primary data structure in de novo de Bruijn graph assembly approaches [29, 30, 40]. The 
key addition in HUGs is presence of paired-end edges coming from the sequencing read data. 
Hence, while during assembly, the de Bruijn graphs are iteratively simplified to construct 
MAGs [29, 30] in KO denser and more complex HUGs are analyzed directly to facilitate the 
capture of repeat dynamics within microbial communities. Conversely, pangenome graphs are 
typically constructed from annotated genome assemblies, and capture high-level variation in 
synteny and copy numbers of gene clusters across related microbial genomes. In this context, 
HUGs bridge the gap between exact compaction achieved in the compacted de Bruijn graphs 
and high-level genomic variation representation of pangenome graphs [41-43]. 

Thus, compared to compacted de Bruijn graphs (Figure (iE) HUGs offer additional connec- 
tivity information based on local similarity and inferred adjacency between unitigs. Compared 
to the pangenome graphs, HUGs do not require neither complete genome assembly nor iden- 
tification of putative gene clusters (Figure Ep) and hence can be constructed more efficiently 
from short paired-end read data. 


2.2. Analysis of an ME/CFS cohort 


First, we compared the overall distributions of the number of unitigs reconstructed from con- 
trol samples with the ones from patients with short and long-duration ME/CFS. We observe 
that all samples in the control cohort contain more than 50,000 unitigs per sample, with 85 
out of 92 samples containing between 50,000 and 400,000 unitigs (Figure 2). In contrast, 3 
samples derived from patients with short-term ME/CFS contain less than 50,000 unitigs, and 
62 out of 73 samples in this category contain up to 250,000 unitigs (Figure [2). Similarly, the 
data for long-term ME/CFS contains 6 samples with less than 50,000 unitigs, and 68 out of 
73 the samples fall into the 0 to 300,000 unitigs range (Figure R). 

Next, we have designated unitigs with Core-A anomaly scores three standard deviations 
(30) above their corresponding sample’s mean (u) as the anomalous unitigs for the correspond- 
ing samples. We have explored the distribution of degrees (Figure BA) in the anomalous unitigs 
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Fig. 1. (a) A set of 5 unitigs labeled u1, u2, ug, u4, and us with the corresponding read mappings of 
rı, r2, and r3. Note, that the read r3 maps to both the end of unitig ug and the start of the unitig u4. 
(b) A HUG corresponding to the unitigs and reads in (a). Edges marked in red are local similarity 
edges, and edges marked in black are adjacency edges arising from the paired reads (r1, r2). The 
magenta edge {u2, u4} is an adjacency edge arising as a result of multi-mapping of a single read. 
(c) A schematic representation of a compacted de Bruijn graph. Dark blue nodes represent k-mers, 
while light blue and red nodes represent unitigs that have been compacted from unambiguous paths. 
(d) A schematic representation of a pangenome graph. Colored blocks represent gene clusters and 
arrows indicate possible paths through the gene sequences as indicated by corresponding genome 
assemblies. 
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Fig. 2. Distribution of total unitig counts in HUGs constructed from control (left), ME/CFS short 
duration (center), and ME/CFS long duration (right) subjects’s gut microbiome samples. Samples 
corresponding to short ME/CFS condition show lower absolute counts of unitigs, while those cor- 
responding to the long ME/CFS are more similar to the controls. Both short and long ME/CFS 
associated samples have less high unitig count representatives. 


based on the sample type and noted that the overall distributions are skewed to the left for all 
sample types. However, in the range of degrees from 250 to 1250, short-term ME/CFS samples 
exhibit sharper concentration towards the lower degree values than long-term ME/CFS and 
control samples. Additionally, in the 380-500 range of degrees long-term ME/CFS samples 
exhibit a more uniform distribution. The distributions in Figure BIA were tested for statistical 
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difference using Kolmogorov-Smirnov (KS) test. All three pairwise distribution comparisons 
were significant with p-value < 107°. Since the degree of a unitig in an HUG depends on the 
number of locally similar unitigs and potential genomic adjacencies of it, this indicates that 
long-term ME/CFS communities have more anomalous highly connected unitigs. 

We also investigated the distribution of coreness values in the anomalous unitigs grouped 
by condition (Figure BB). Similarly to the degree distributions for the low coreness values 
(0-100) all three sample types agree. Analogously, short-term ME/CFS samples also have the 
distribution of the coreness skewed towards lower values. This agrees with the observations 
in Figure |2| and Figure BA, as the lower overall unitig count (and hence a smaller graph), 
and lower degrees (which provide an upper bound on coreness) would result in lower coreness 
values. Long-term ME/CFS samples have several peaks in the distribution (coreness 180-200, 
280-320) that are not observed in the controls. The distributions in Figure [BB were tested for 
statistical difference using KS test. All three pairwise distribution comparisons were significant 
with p-value < 107°. Since coreness is a proxy for the level of interconnectedness in a group of 
unitigs, this can indicate the presence of clusters of unitigs corresponding to either a complex 
repeat architecture or a high abundance of closely related organisms. 


A Ditribution of degrees in anomalous unitigs [Core-A >= u + 30] 
Control ME/CFS (Short) ME/CFS (Long) 


250 500 750 1000 1250 250 500 750 1000 1250 250 500 750 1000 1250 


Degree 
B Ditribution of coreness in anomalous unitigs [Core-A >= u + 30] 
Control ME/CFS (Short) ME/CFS (Long) 


100 200 300 400 500 100 200 300 400 500 100 200 300 400 500 
Coreness 


Fig. 3. (A) Distribution of the degrees of the unitigs that have Core-A anomaly score above u +30 
for their corresponding samples. Distributions for unitigs of degrees 0-250 are omitted for clarity. 
We note the samples corresponding to short-term ME/CFS have a distribution more skewed to the 
left. (B) Distribution of the coreness of the unitigs that have Core-A anomaly score above u + 30 
for their corresponding samples. Distributions for unitigs of coreness 0-100 are omitted for clarity. 
Similarly to degree distribution, short-term ME/CFS samples show a skew towards lower coreness 
values. Additionally, long-term ME/CFS samples have more uniform distribution in the 100-200 
range compared to controls and more unitigs in the 250-350 coreness range. 
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Next, we investigated the a-diversity at the species and genus level, as well as the overlaps 
between species and genus level classifications based on Kraken 2 [44] predictions for the 
anomalous unitigs in the three groups (Figure4). We note that at both species and genus level 
the short-term ME/CFS samples exhibit a lower average a-diversity which can be indicative 
of dysbiosis. Additionally, at both species and genus levels, most taxonomic annotations are 
shared among the three cohorts. Still, the control group has consistently more unique taxa 
identified, further supporting the role of diverse microbial community composition in healthy 
individuals. Similarly, the long-term ME/CFS cohort has more unique taxa than the short- 
term ME/CFS cohort, indicating partial recovery from the dysbiosis. 

We next compared the results for a-diversity and the overlaps obtained from anomalous 
unitigs, to the same information computed for the unitigs in the highest A-truss (Figure[5). We 
note that unlike in the case of general anomalous unitigs, those that belong to the highest K- 
truss show more similarity in the a-diversity between control and short-term ME/CFS samples, 
with long-term ME/CFS sample being the outlier (Figure BA, B). Additionally, the total a- 
diversity in the trusses (Figure BA, B) is noticeably lower than in general anomalous unitigs 
(Figure JA, B). This is expected given trusses are densely connected subgraphs of a HUG, and 
hence have higher propensity to represent closely related genomic segments. Higher a-diversity 
in the long-term ME/CFS trusses can be a potential indicator for functional enrichment with 
multiple taxa coding for the same function in the long-term ME/CFS microbiota. Furthermore, 
we observed that, while the number of species and genera shared between all three cohorts 
makes up a smaller fraction of the total classifications. Namely, while species shared between 
all three categories make up 27.5% (1808 of 6567) of all species identified in the anomalous 
unitigs of the three cohorts (Figure ME), they make up only 24.1% (177 of 735) of all species 
identified in the trusses of the samples (Figure BE). Analogously, the shared genera make up 
55.0% (1008 out of 1834) of all classifications for anomalous unitigs, and only 33.0% (156 out 
of 473) of all classifications for truss unitigs. 

Additionally, when individual KO profiles are visualized for samples matched by age, gen- 
der, and race (Figure |6) we observe more compact profiles for the disease-associated samples. 
This matches the dysbiosis hypothesis, with the long-term ME/CFS sample showing a more 
complex profile than the short-term one. In all samples shown in Figure [6] unitigs with high 
anomaly scores are the ones for which the degree is larger than the expected coreness. This 
pattern occurs when a unitig is flanked by varying genomic contexts across the metagenome, 
and hence indicate unitigs with high inter- and intra-genomic copy numbers. 


2.3. Computational performance 


The k-core decomposition algorithm runs in O(|V| + |E]|) time [45] and hence scales linearly 
with the size of the graph. This scaling is particularly attractive in metagenomic communities, 
where the number of edges |E| is proportional to the number of vertices |V|. In the case of more 
complex repeat architectures, such as Alu repeats in human genome, the number of edges is 
be proportional to the square of the number of vertices. Compared to Brandes’s algorithm for 
betweenness centrality (a common algorithm for detecting influential nodes in a network) [46] 
k-core decomposition algorithm is significantly faster. The asymptotic time complexity of 
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Fig. 4. (A, B) Distribution of alpha diversity (Shannon entropy) for anomalous unitigs that have 
anomaly score three standard deviations above the mean for the respective sample grouped by the 
condition and duration. p-values from Welch’s t-test for equality of means are displayed above the 
boxplots. (A) Alpha diversity of samples associated with the ME/CFS condition is lower than that of 
control samples. Long ME/CFS samples on the other hand do not appear to be noticeably distinct 
from the control ones. Entropy was calculated based on the unitigs for which Kraken 2 provided 
species-level classification. (B) Alpha diversity of samples associated with the ME/CFS condition 
is lower than that of control samples and long ME/CFS samples. Long ME/CFS samples on the 
other hand do not appear to be noticeably distinct from the control ones. Entropy was calculated 
based on the unitigs for which Kraken 2 provided genus level classification (species annotations are 
rolled up into respective genus). (C) Venn diagram representing intersections between sets of species 
identified in the control, short ME/CFS, and long ME/CFS sample collections. (D) Venn diagram 
representing intersections between sets of genera identified in the control, short ME/CFS, and long 
ME/CFS sample collections. 


Brandes’s algorithm for unweighted graphs is O(|E||V|), which even in the |E| ~ a|V| regime, 
leads to O(|V|?) complexity compared to O(|V|) for the k-core decomposition. 

The K-truss decomposition has an asymptotic time complexity of O(|E|'°) [47], making 
it slower than the k-core decomposition. Nevertheless, since we are only interested in the 
vertices contained in the maximal K-truss, we make the simplification of running the K-truss 
decomposition on only the maximal k-core of the graph, similar to the prior work [32]. 

Empirically, in addition to analyzing the ME/CFS data from Xiong et al. study [34], we 
have also benchmarked KO on the IBD data [35] from integrative HMP, as well as chromosome 
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Fig.5. (A, B) Distribution of alpha diversity (Shannon entropy) for anomalous unitigs that belong 
to the highest K-truss. p-values from Welch’s t-test for equality of means are displayed above the 
boxplots. (A) Alpha diversity of samples associated with the ME/CFS condition is lower than that of 
control samples. Long ME/CFS samples on the other hand do not appear to be noticeably distinct 
from the control ones. Entropy was calculated based on the unitigs for which Kraken 2 provided 
species level classification. (B) Alpha diversity of samples associated with the ME/CFS condition 
is lower than that of control samples and long ME/CFS samples. Long ME/CFS samples on the 
other hand do not appear to be noticeably distinct from the control ones. Entropy was calculated 
based on the unitigs for which Kraken 2 provided genus level classification (species annotations are 
rolled up into respective genus). (C) Venn diagram representing intersections between sets of species 
identified in the control, short ME/CFS, and long ME/CFS sample collections. (D) Venn diagram 
representing intersections between sets of genera identified in the control, short ME/CFS, and long 
ME/CFS sample collections. 


21 and chromosome 11 aligned reads from human genome HG002 from the Genome in a Bottle 
project [36]. The choice of human genome sequencing data is motivated by highly repetitive 
complex Alu regions present in the genome, hence the regime in which |E] ~ a|V/? is in 
the HUG. All benchmarking was performed on a Ubuntu 18.04.6 LTS system with Intel(R) 
Xeon(R) Gold 5218 CPUs and 312GB of RAM and all runs used 60 threads. The results of 
benchmarking are summarized in Table 

The results in Table |1| showcase that resulting graph edge density is an important com- 
ponent of the overall computational performance, as indicated by a high run time value for 
the HG002 chromosome 11 experiment. Compared to the original KOMB implementation, we 
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Fig. 6. Individual KO profiles (color: normalized Core-A anomaly score) for three samples from 
the ME/CFS cohort matched by age, gender and race. We observe a more complex profile in the 
control sample, while short- and long-term ME/CFS show compact profiles associated with lower 
bacterial genomic diversity. In all three samples, anomalous unitigs are predominantly high in degree 
compared to their coreness. 


Table 1. Performance of KO on metagenomic and human genome datasets. Total dataset size refers 
to the cumulative size of all data processed, while average sample size describes the mean size of a 
single sample in a dataset. Analogously average runtime refers to mean time to process a single sample, 
while total runtime refers to cumulative time spent analyzing the dataset sequentially. 


Dataset ## samples | Total dataset | Average sample | Average wall clock | Total wall clock 
size (GB) size (GB) runtime (hrs) runtime (hrs) 
ME/CFS cohort 238 2,422 10.18 0.11 26.42 
iHMP IBD 540 3,120 5.78 0.19 104.82 
HG002 chr21 (300x) 1 26.17 - - 0.71 
HG002 chr21 (250bp) 1 5.90 - - 0.14 
HG002 chr11 (250bp) 1 20.43 - - 1.70 


achieve up to a 3-fold speed up for metagenomic samples containing an average of 16 million 
reads [24]. Additionally, we have performed a head-to-head comparison of KOMB and KO on 
a Zymo mock community sequenced by DOE Joint Genome Institute (BioProject Accession: 
PRJNA699918). We chose the Zymo mock community due to the large sample size (42 GB) 
and relatively simple genomic structure, allowing us to focus on the HUG construction perfor- 
mance, which we identified as a bottleneck, rather than the efficient k-core decomposition part 
of the analysis. On this data KOMB required a total of 7h16m of wall clock time (CPU time: 
273h34m) and 48.28 GB of RAM to produce the final results, while KO required a total of 
1h4m of wall clock time (CPU time: 14h34m) and 156.18 GB of RAM, note that both versions 
were ran with 40 threads for this experiment. The speedup is the result of three major changes 
in KO: (1) replacement of ABySS [48] with GGCAT [39] for the unitig construction, (2) change 
from BWA MEM to BWA MEM 2 for read mapping, and (3) improved parallelization in the 
KOMB codebase. 


3. Discussion 


In this work, we have provided a set of computational improvements to KOMB implemented 
in KO, and theoretical analysis of connections between HUGs and pangenome graphs. Addi- 
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tionally, we showcased the usage of KO on a ME/CFS patient cohort and identified disease- 
associated patterns. The dynamics inferred from KO profiles and taxa associated with im- 
portant unitigs are concordant with the observations from a prior study [34]. Namely, we 
observe pronounced dysbiosis in short-term ME/CFS patients gut, and partial recovery from 
the dysbiosis in the long-term ME/CFS patients. We envision KO as an important tool to 
be integrated alongside existing approaches into clinically relevant microbiome studies. The 
key benefits of KO are: (a) selection of a small set of anomalous sequences without relying 
on taxonomy nor functional annotation, which can allow de novo analyses of these sequences 
and more sensitive detection of perturbations to host microbiome health, and (b) rapid pro- 
filing of a large number of samples, which can aid in the exploration of genotype-phenotype 
connections for large study cohorts. 

An important next step is extending KO to an integrated approach that can annotate 
unitigs within the graph with associated transcriptomic or metabolomic information. Enriching 
the graph with multi-omic annotations can provide additional context for the nodes identified 
by KOMB as anomalous, enabling further functional associations to be extracted from the 
HUG structures. We believe that by adding -omics annotations KO can be further used to 
select genomic features relevant to the pathology, and hence enable better machine learning 
diagnostic tools. We also plan to add ability to distinguish between the edge types described 
in the Methods section and add the multi-omics annotations to the HUGs to directly extract 
hubs of functionally important genomic regions of a microbiome. 

Additionally, it can be of interest to construct HUGs based on publicly available MAG 
catalogs as an annotation-rich reference for common community patterns identified in pre- 
vious studies. We believe that this integrative large-scale approach can further illuminate 
mechanistic associations between microbiome and disease phenotypes. 


4. Methods 
4.1. Hybrid unitig graph construction 


We begin construction of the HUG by constructing the underlying de Bruijn graph with a 
user-specified k-mer size parameter. The construction is done with the GGCAT [39], and the 
user can control the parameters exposed by the GGCAT command line interface. GGCAT 
produces a FASTA output file containing all maximal non-branching paths through the de 
Bruijn graph (unitigs). After unitigs are constructed the user has an option to specify an 
additional length based filtering step. Our recommended choice is setting this filter to be 
equal to the read length. 

After construction and filtering, the final set of unitigs becomes the set of vertices in the 
HUG. Next, we perform read mapping of the input paired-end reads to the set of unitigs 
using BWA MEM 2 [49] v2.2.1 with the default parameters. We retain all read mappings for 
constructing the edges of the HUG. An edge is constructed between two unitigs u and ug if 
either the same read maps to both of the unitigs, or, one read in the pair maps to u; and the 
other read in the pair maps to uz. More precisely, let rı and r2 be two paired end reads and 
let M(r1), M(r2) be the sets of unitigs that rı and rg are mapped to. Then the initially empty 
set of edges (E) in the HUG is united with the set of newly created edges, i.e. 
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(a) E+ EU {{u,v}:ue M(r;) and v € M(r;) \u for i = 1,2} 
(b) E + EU {{u,v}: ue M(ri) and v € M(re) \ u} 


Conceptually two kinds of edges arise from this construction: (a) local similarity edges, 
which capture subregions of unitigs that are similar as evidenced by the read mapping, and 
(b) adjacency edges which have potential proximity of two unitigs with a genome. While it is 
natural to expect that single read multi-mapping corresponds to similarity edges and paired- 
end information corresponds to the adjacency ones, it is worth noting that single read mapping 
also can contribute to the adjacency edge formation (see Figure [h, b). We currently do not 
distinguish the two edge types (local similarity vs adjacency) in implementation. 


4.2. k-core decomposition and Core-A anomaly score 


The k-core of a graph is the maximal induced subgraph in which each node has a degree of 
at least k. If the vertices of the k-core of a graph are represented by Vz, then the coreness 
of vertex v is defined as coreness(v) = max{k : v € Vk}. Computing the coreness of each 
vertex is called k-core decomposition. Once the HUG is constructed, we perform a k-core 
decomposition of it using the igraph C library [50] implementation of the linear time Batagelj- 
Zaversnik [45] algorithm, which assigns a coreness to each vertex in the HUG. Subsequently, 
for each unitig a Core-A anomaly score is computed as specified in previous work on anomaly 
detection in networks [31]. In particular, for each vertex v we compute its rank based on the 
degree rankg(v), and its rank based on coreness rank,(v). The Core-A anomaly score is then 
defined as the absolute value of the difference of the log of the two ranks, i.e. Core-A(v) = 
| log rankg(v) — log rank,(v)|. 

There are two key groups of unitigs with high anomaly scores: (a) individual anomalies 
and (b) anomalous clusters. In general, for any vertex v the shell number is upper bounded 
by the degree of that vertex. Thus, individual anomalies are nodes with a large discrepancy 
between their degree and coreness. In particular, this is can be described by the individual 
influence, ii value, defined as ii = 1 — coreness(v)/deg(v) that is equal to 0 if the degree and 
shell number are equal, and approaches 1 for values of degree significantly larger than that 
of coreness. Individual anomalies are unitigs likely to have varying genomic contexts in the 
metagenome. Thus, individual anomalies are good candidates for mobile genetic elements or 
duplicated genes. Anomalous clusters on the other hand are more likely to arise due to shared 
local similarities between a large group of unitigs. Those can be nearly identical repeats, such 
as Alu elements in human genomes, or hypervariable regions of ribosomal proteins in bacterial 
genomes. 


4.3. K-truss computation 


A K-truss of a graph is an induced subgraph in which every edge is present in at least 
K — 2 triangles. A method proposed by Malliaros et al. [32] computes the maximal K-truss 
by computing it for the k-core of the graph, since the K-truss of a graph is always a subgraph 
of its K — 1-core. Thus, as the k-core decomposition of the HUG is computed, we select the 
k-core subgraph of the HUG and then compute its K-truss decomposition using the igraph 
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C library’s implementation of Wang and Cheng’s algorithm [47]. The algorithm assigns a 
trussness value to each edge in the subgraph, representing the maximum value of K for which 
the edge is present in the K-truss. 

Now, let V; be the set of nodes and let Ep be the set of edges of the maximal k-core subgraph 
of the HUG, and define 7 : Eg —> N to be the mapping realized by the igraph algorithm, whose 
time complexity is O(|E;|'°). We then set K = max{r(e) : e € Eg} and select the vertices (i.e. 
unitigs) in the maximal K-truss to be those in {v € Vp : r(e) = K for some e incident to v}. 


4.4. Taxonomic classification and a-diversity calculations 


Taxonomic classification of the unitigs was performed with Kraken 2 [2] with the standard 
parameters (k = 35, £ = 31) and the standard Kraken 2 database consisting of RefSeq viral, 
bacterial, and archeal genomes, as well as human genome and known vector sequences from 
UniVec_Core. For a-diversity computations, the unclassified portion of unitigs was discarded, 
and the remaining fractions were re-normalized to add up to 1. The a-diversity was defined 
as the Shannon entropy of the classified unitig fractions H = —)0,-7 filog fi, where fi is the 
fraction of unitigs classified as taxa i. 


5. Data availability 


This work has not produced any new sequencing data, and relied on publicly available datasets. 
Details for accessing these datasets are specified below. 

ME/CFS metagenomic sequencing data. Illumina short paired-end sequences 
(150bp) from stool samples of 92 controls, 73 short-term ME/CFS, and 73 long-term ME/CFS 
patients were analyzed [34]. Original sequencing data was deposited into SRA under BioPro- 
ject accession PRJNA878603. 

IBD data from integrative HMP. Illumina short paired-end sequences from stool 
samples of patients with IDB were analyzed [35]. We analyzed a subset of 540 out of 1,613 
available samples. Data is available from the HMP portal (https://portal.hmpdacc.org/) via 
study IBDMDB. 

Human genome dataset. We have used Illumina short paired-end reads (150bp 
and 250bp) from Genome in a Bottle project. We used aligned reads for HG002 genome 


that can be accessed via the index hosted on GitHub: https://github.com/genome-in-a- 
bottle/giab_data_indexes 


6. Code availability 


KOMB source code is publicly available on GitHub: https://github.com/treangenlab/komb 
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Abstract. Advances in molecular characterization have reshaped our understanding of low-grade 
glioma (LGG) subtypes, emphasizing the need for comprehensive classification beyond histology. Lever- 
aging this, we present a novel approach, network-based Subnetwork Enumeration, and Analysis (nSEA), 
to identify distinct LGG patient groups based on dysregulated molecular pathways. Using gene expres- 
sion profiles from 516 patients and a protein-protein interaction network we generated 25 million sub- 
networks. Through our unsupervised bottom-up approach, we selected 92 subnetworks that categorized 
LGG patients into five groups. Notably, a new LGG patient group with a lack of mutations in EGFR, 
NF1, and PTEN emerged as a previously unidentified patient subgroup with unique clinical features 
and subnetwork states. Validation of the patient groups on an independent dataset demonstrated the 
robustness of our approach and revealed consistent survival traits across different patient populations. 
This study offers a comprehensive molecular classification of LGG, providing insights beyond tradi- 
tional genetic markers. By integrating network analysis with patient clustering, we unveil a previously 
overlooked patient subgroup with potential implications for prognosis and treatment strategies. Our 
approach sheds light on the synergistic nature of driver genes and highlights the biological relevance of 
the identified subnetworks. With broad implications for glioma research, our findings pave the way for 
further investigations into the mechanistic underpinnings of LGG subtypes and their clinical relevance. 
Availability: Source code and supplementary data are available at https://github.com/bebeklab/nSEA 


Keywords: Cancer Systems Biology - Network Analysis - Protein-protein Interaction Networks. 


1 Introduction 


Lower-grade gliomas (LGG) are brain neoplasms classified into 3 grades by the World Health Organization 
(WHO), where grades 2 and 3 present an infiltrative phenotype. While some LGGs remain stable, others 
progress to grade 4 gliomas (grade 4 astrocytoma [[DH-mutant tumors] and glioblastoma [[DH-wildtype 
tumors]), resulting in survival ranges between 1 and 15 years. Common treatment options include resection, 
chemotherapy, and radiation therapy. Based on the origin of glial cells, LGG can be classified into two sub- 
types: astrocytomas and oligodendrogliomas. Molecular features are also associated with clinical outcomes; 
for example, LGG with both an IDH mutation (IDH1 or IDH2) and deletion of chromosome arms 1p 
and 19q (1p/19q codeletion) show a better response to radiochemotherapy and are associated with longer 


© 2023 The Authors. Open Access chapter published by World Scientific Publishing Company and 
distributed under the terms of the Creative Commons Attribution Non-Commercial (CC BY-NC) 
4.0 License. 
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survival. However, neither grade-based stratification nor molecular features can fully capture the complex 
architecture of LGG. 


Gliomas are histopathologically classified into four grades associated with a worse prognosis. While this 
classification has prognostic value, investigating the complex molecular alterations within gliomas can lead to 
a better understanding of the biology behind the tumor types. For instance, some low-grade gliomas behave 
like malignant glioblastoma, while others have a favorable outcome similar to low-grade gliomas. Identifying 
genetic and epigenetic alterations in these tumors can reveal biomarkers with both prognostic value and the 
potential to guide therapeutic decisions [1]. 


Recently, studies by The Cancer Genome Atlas (TCGA) on lower-grade diffuse gliomas defined disease 
classification based on genetic and epigenetic alterations, providing biological justification for the utility of 
these features over histologic ones. Integrated genome-wide data analysis from multiple platforms delineated 
three molecular classes of lower-grade gliomas that were more concordant with IDH, 1p/19q, and TP53 
status than with histologic classes [2]. 


In recent years, various approaches have been proposed for finding disease-related sub-networks [3-7] or 
predicting disease-causing genes [8-11] from large knowledge bases, such as protein-protein interaction (PPI) 
networks or signaling pathway databases. Most of these methods integrate systems-level measurements of 
gene and/or protein expression to prioritize networks [12-17]. A scoring function is combined with a search 
strategy to evaluate identified sub-networks. However, since finding sub-networks is an NP-hard problem [12], 
long run times and sub-optimal solutions are major drawbacks of these applications. Among all applicable 
methods, Kernel clustering, modularity optimization, random-walk-based, and local network search methods 
outperform others [6]. While some of these approaches can identify prognostic modules or disease-relevant 
pathways [12, 18,6], they lack the ability to prioritize modules for disease subtype identification and subse- 
quent survival analyses. 


Enrichment-based pathway analyses are also commonly used to identify biological functions related 
to biomarkers and study disease subtypes in cancer [19-21]. However, since such approaches depend on 
previously selected genes, these analyses may lead to biased results. For instance, Sanchez-Vega et al. [22] 
analyzed the mechanisms and patterns of somatic alterations in ten canonical pathways and mapped them to 
multiple tumor types to discover pan-cancer subtypes and link them to possible drug targets. This supervised 
approach easily captured known subtypes with known disease pathways. In contrast, Durmaz et al. [23] 
reported an unsupervised approach that repeated this identification process using frequent subgraph mining 
with sampling and identified 106 clusters from 43K sub-networks mined from patient-specific networks. 
However, the former approach lacks the freedom to discover new subtypes, while the latter randomized 
approach requires careful filtering and repeated trials to arrive at robust discoveries. 


In this paper, we introduce a novel network analysis algorithm known as the n-Node Subnetwork Enu- 
meration Algorithm (nSEA). Our aim is to address challenges encountered by disease classification methods, 
which often rely on disease-associated genes or subnetworks for patient characterization and prognostics. 
Here, we discern robust patient subtypes based on functional variations in gene/protein expression within 
each sample and their interactions. This approach enables us to establish a patient classification framework 
that not only enhances prognostic accuracy but also elucidates the distinct pathway-level differences among 
patient subgroups. Such an approach holds the promise of improved prognostication for future patients, 
along with opportunities for enhanced treatment strategies and personalized interventions. 


The (nSEA) algorithm takes a protein-protein interaction (PPI) network and system-level measure- 
ments of gene expression profiles as inputs. The goal of nSFA is to identify differentiating patterns among 
disease samples in an unsupervised manner. The algorithm is based on a bottom-up methodology in which 
a large sparse biological network (a PPI network filtered by patient gene expression profiles) is exhaustively 
enumerated and decomposed into n-node sub-networks (Figure 1A and 1B). These sub-networks are then 
evaluated, ranked, and filtered based on their inner-pattern consistency and network topology (Figure 1C). 
In simple terms, the presented method aims to exhaustively identify n-node sub-networks that exhibit con- 
sistent expression patterns of network edges, quantified by the delta of gene expressions. The selected n-node 
sub-networks are expanded to include their neighboring nodes, forming more stable network structures (Fig- 
ure 1D). By applying principal component analysis to network states, we identified sub-networks capable 
of discriminating disease states (Figure 2A-E) [24,25]. The final set of sub-networks represents the major 
dynamics in the PPI network and provides a global picture of pathway dysfunction across cancer subtypes. 
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Fig. 1: Diagram of the nSEA algorithm. The algorithm takes a protein-protein interaction (PPI) network 
and gene expression profiles of samples as inputs. (A) The PPI network is converted into a sparse network. 
Edges are filtered based on the expression difference of their corresponding node pairs. (B) Network enu- 
meration concept: All possible 4-node sub-networks are extracted from the original network, forming a list. 
Letters represent proteins. Three 4-node sub-networks and their positions in the list are annotated in colors 
as examples. (C) Feature selection based on the sub-network list. Sub-networks are ranked according to 
their inner-pattern consistency in a decreasing manner. They are then scanned and tested for topology (not 
shown in the diagram) from top to bottom. If a sub-network is selected into the feature set, it will exclude 
other sub-networks that share any node with it. (D) Selected sub-networks are expanded to neighboring 
nodes that share similar patterns, forming larger sub-networks. Solid lines represent edges at the current 
step, while dashed lines represent potential edges that can be added during expansion. Non-significant edges 
are omitted in this figure. (E) Specific application of nSEA to Lower grade gliomas (this study). Data is 
represented by a square and the process is represented by a ”squircle.” The basic properties of the data 
between each step were also annotated. 


We applied nSEA to LGG samples and identified 5 latent groups/subtypes. We compared our sub- 
types with the current classification and identified significant sub-networks related to our clustering. We also 
explored the mutation, copy-number variation, and methylation features driving the force behind this classi- 
fication and discussed several hypotheses based on these results. Furthermore, we compared our method with 
existing disease classification methods and validated our classification using an independent LGG cohort. 


2 Methods 


2.1 nSEA algorithm 


The nSEA algorithm is based on a bottom-up methodology with which a large sparse biological network, 
G(V, E), is enumerated and decomposed into n-node subnetworks exhaustively. The goal of the algorithm 
is to identify subnetworks that can classify patients into subgroups and also provide distinctive biological 
states for each patient group based on these subnetworks. The first step is to create a network that is 
sparse enough for further processing. The PPI networks available today are too large for any enumeration 
algorithm to complete in a reasonable time. We create a sparse network to speed up the process while 
preserving relevance to disease classification by utilizing gene expression profiles. This is accomplished by 
using a protein-protein interaction (PPI) network and system-level measurements of gene expression profiles 
as inputs. Since the subnetwork vector we will calculate in the next steps represents the first principal 
component or the largest variance of the expression values within the subnetwork, edge filtration should also 
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facilitate achieving this (See a toy example of how this vector is generated in Section $1.1). Let e € E and 
v € V of the PPI network G = (V, E). We define an edge score Se, between nodes (genes/proteins) v; and 
vj as: 

Ser = 0 (Jo; — Jo; ), ep = (vi, vj), >j (1) 


where o is the standard deviation and g is the expression vector of the gene (Figure 2). Edge filtration was 
done by selecting the top 5% edges ranked by the edge score Se,- 

Enumeration was done by generating up to 4-node connected subnetworks from the filtered dataset. 
While larger n is possible to use, due to exponential increase in size, we only generated up to 4-node 
subnetworks only (See Section $1.2). Enumeration of all possible subnetworks was done to exhaustively 
identify and rank all possible subnetworks. To filter out insignificant subnetworks, the subnetwork score 
(inner-pattern consistency) of each n-node subnetwork was calculated: 


Age, = Jv; — Joj» ek = (vi, vj), >j (2) 
cor (Age, , Age 
Sco = 2! = Gey) | hy (3) 


where g,, denotes expression vector of node (gene) v; and A ge, denotes edge vector of edge ep. cor denotes 
Pearson correlation. |e| denotes the total edge count in the subnetwork. Sp, denotes score for subnetwork. 
To avoid extreme cases when only one node has a degree larger than 1, 4-node subnetworks with an average 
degree less or equal to 0.75 were discarded. A threshold of the subnetwork score was set and all subnetworks 
with a score below the threshold were discarded. 

Feature selection for the subnetwork list L was done using Algorithm 1. First, all subnetworks are ranked 
in descending order and placed in an array. While there are subnetworks in this array, the top network is 
saved as a feature and removed from the array. The feature network is then compared against the other 
subnetworks in the array. If any subnetwork has shared genes with the selected feature, it is removed from 
the array. The final set of subnetwork features is returned. 


Algorithm 1: Feature selection for n-node subnetworks 


Data: Set of subnetworks L, scoring function S 
Result: Feature Set F, a set of subnetworks with unique nodes 
S + rank(L , S) // rank subnetworks with score function S from Eq. 3 
F + ý // Feature set is empty; 
while S 4 ý do 
t  maz(S) // first subnetwork in the ranked list is t ; 
S¢S-t; 
foreach u € S do 
// check if any nodes (genes) are shared 
if V(u)N V(t) #0 then 


| S¢S-u; 
end 
end 
F + Fut // add t to Feature set ; 
end 


For subnetwork expansion, nodes (genes) neighboring the subnetwork (u) were added to the subnetwork 
one by one (Algorithm 2). At each iteration for each neighboring node, we test: 


S(u) > Sr, S(u)—SG)2T, |EQ)|- |Eu)| > al E(u)| (4) 
where S(u) denotes the subnetwork score at the expansion step. Sr denotes the minimum threshold 


for subnetwork score expansion, which is set to be 0.87. T is a threshold for the tolerance of score decrease. 
|E(u)| denotes the total number of edges in the subnetwork. a is a constant coefficient, where the set of 
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nodes in the network will not grow in size more than this ratio. 7 is the network state assuming the node 
being considered is added to the subnetwork. The purpose of these two rules is to prevent the subnetwork 
from infinite expansion. If the rules are not satisfied, the expansion will stop. In this study, we set T to 0.05 
and a to 0.25. We then select the neighboring node (gene) which has the largest score and add that node to 
the subnetwork. This process is repeated until no node can be added due to constraints. 


Algorithm 2: The subnetwork expansion algorithm 


Data: Set of feature subnetworks F, where u € F, and networks are scored by function S 
Sr denotes the minimum threshold constant for subnetwork score expansion (see Section 2.2) 
G is the protein-protein interaction network. 
Result: Expanded subnetwork u 
foreach u € F do 
repeat 
foreach v’ € V(G) ,v € V(u) , (v, v’) € E(G) do 
j uU {v} ; 
if S(u) > Sr , S(u) — $(j) > T, |Elu;)|— |E(u)| > a|B(u)| then 

if maxj < S(j) then 

maxj <S(j) 


v” ey’ 
end 
else 
| break; 
end 
end 
u uU{v"}; 
until S(u) > Sr , S(u) — S(j) > T, |B(u;)| — |E(u)| > a E(w); 


end 


2.2 Parameter Tuning 


The aforementioned values of parameters were determined by parameter tuning. These include the edge 
selection proportion (a), the low threshold of subnetwork score (Sr), and the number of clusters for patient 
clustering (Nco). First, Sr and Nc were tuned while a was fixed to 5%. Two indicators were used to optimize 
Sr and Nc. One was the clustering stability (Cs), and the other one was the distance from the background 
(Dg). Cs is the mean of cluster-consensus values calculated by the ConsensusClusterPlus package. Dg 
is defined as the distance from background clustering, the clustering result generated by setting Sr to 0. 
Specifically, the distance is defined as: 


Dp =1- F Mindez (Csr, Co) (5) 


where Csr is the clustering labels from threshold Sr and Co is the clustering labels when Sr = 0. Fowlkes- 
Mallows index (F'Mindex) is a measurement of similarity between two clustering results [26]. By gradually 
increasing Sr, for each number of clusters (k), the relationship between Sr and two indicators, Cs and Dp, 
was explored (Figure S1A and S1B). Noticeably, Dg increases with Sr, which indicates that the feature 
selection step is necessary in order to generate different clustering results from the background. For C's, it 
is interesting that C's reaches its maximum value when Nc is 5. We then further explored the relationship 
between Cg and Dg (Figure S1C). By considering both indicators, three Sy values from No = 5 were very 
prominent. Among 0.83, 0.85, and 0.87, we chose 0.87 as the final Sr value since when both Dg and Cs are 
similar, Cg is a more important parameter than Dp. 

Second, the proportion of edge selection (a) was evaluated. Due to the limitation of computation power, 
5% is almost the maximum percentage of edges we can keep. We then gradually decreased a to inspect 
its influence on patient clustering. By fixing Dg and C's as mentioned above, FM indices between each 
clustering result caused by different a values were calculated. In addition, we fixed a to 5% but sampled its 
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Fig. 2: Subnetwork variables and their relationships A subnetwork consisting of 6 nodes and 8 edges. 
The subnetwork state, which represents the expression pattern of this subnetwork in sample 1, is colored 
according to gene expression levels. Expression matrix of the subnetwork in (A) with 10 samples. Expression 
values are centered and scaled. Edge vector is defined as the difference between expression vectors of the 
corresponding node pair. Edge A-D is used here as an example. The edge matrix combines all edge vectors 
from the subnetwork. The edge correlation matrix is calculated from the edge matrix. The lower triangle 
(diagonal excluded) of the matrix is used to calculate the Pattern Consistency score which is defined as 
the mean of the absolute values of the correlations. The subnetwork vector is defined as the first principal 
component of the expression matrix. It is used as the summary of the patterns of this subnetwork across all 
samples. It is also used to cluster samples in the following steps. 


subnetwork features (using 80% of all the features each time) to evaluate the error of clustering caused by 
random sampling (Figure S1D). It was interesting that the clustering difference caused by PE was even less 
than the clustering difference caused by 80% random sampling. Based on these results, a did not have a 
significant impact on patient clustering. Therefore, in this project, a was set to 5% since including more edges 
would produce more subnetwork features and therefore provide a better view of the underlying biological 
background. 


2.3 Clustering of LGG patients and subnetworks 


Subnetwork vector was calculated by the prcomp function from R package stats. Consensus clustering of 
patients and subnetworks were done with R package consensusplus. Clustering stability was defined as the 
mean of cluster-consensus values. Fowlkes-Mallows index was used to measure the distance of current clus- 
tering from the background. Consensus clustering of patients and subnetworks was done for 10,000 iterations 
with sampling proportion set to 0.75 and hierarchical clustering (Ward’s method). The self-organizing map 
was done using R som. 


2.4 Clinical analysis and tree models 


Survival difference (including p-value) was calculated by survdiff function from R package survival. 
Distances between patient groups and previous subtypes were defined as the mean Euclidean distance of all 
possible patient pairs from the two clusters. Correlation between subnetwork cluster vectors and telomere 
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length or Karnofsky score was calculated with cor.test function with Spearman’s method and exact set to 
false. GO term (biological process) of subnetwork groups were annotated with enrichgo function from R 
package clusterProfiler. Mutation fold change was defined as the actual mutation count divided by the 
expected count. 

Tree models were trained with rpart function from R package rpart. For binary classification of LG3, 
the parameter minbucket was set to 10, and parameter mazdepth was set to 2. For multi-label classification, 
minbucket was set to 22 to simplify the model and maxdepth was left as default (30). 

Random forest model is trained with TCGA data using the subset function in R. The training process 
used 1000 trees and tried 8 variables at each split, while the importance of the predictor is set to be true. 

Oncogenes and driver genes within each group were identified according to CCGD [27] and Uniprot [28] 
(Supplementary Table $4). Each subnetwork group was annotated by its corresponding activated oncogenes 
as well as the signs of the subnetwork vectors. 


2.5 Comparison with existing methods 


Clustering without gene selection and also nearest shrunken centroid-based gene selection [29] followed by 
network integration was used to compare with the nSEA approach. First, utilizing Consensus clustering, 
hierarchical clustering, principle component analysis, and k-means clustering we grouped patients and in- 
vestigated the patient groups by running survival analysis and investigating clinical variables. Secondly, we 
trained a nearest shrunken centroid classifier. This widely used approach [30-33] is used to identify genes 
that stratify LGG samples. Subsequently, a protein-protein interaction (PPI) subnetwork was generated by 
overlaying the gene expression profiles with a network downloaded from STRING (Section 2.6), followed by 
node pruning and edge filtration. Networks were scored similar to nSEA approach as described in Section 2.1. 
PCA scores were subjected to various clustering techniques, including consensus clustering, K-means cluster- 
ing, hierarchical clustering, and PCA, to classify individuals into multiple distinct classes. The Kaplan—Meier 
plots are generated based on the clustering results. 


2.6 Data preparation 


Gene expression data were downloaded from previously published studies by TCGA [34] and CGGA [35- 
37|. The TCGA datasets were generated by Illumina HiSeq 2000 platform. The level-3 expression data was 
obtained from UCSC Xena Portal [38]. Non-tumor samples were removed from the data resulting in data 
for 516 patients. Gene expression matrix was already log transformed. Genes were normalized using z-score 
normalization across all patients. Outliers were identified by adjboxStats from robustbase R package. The 
CGGA datasets were genereted by Illumina HiSeq platform. The raw gene counts were downloaded from 
CGGA portal from the ‘mRNAseq-693‘ dataset. CGGA data is log-transformed and normalized similar to 
the TCGA dataset. PPI data were downloaded from String PPI Database [39]. PPI network was filtered by 
eliminating edges with a combined evidence score of less than 0.7. The PPI network we downloaded had 
13,562 nodes and 277,172 edges. 


3 Results 


3.1 Subnetworks Classify LGG Samples into 5 Groups 


We employed the n-Node Subnetwork Enumeration Algorithm (nSEA) to analyze LGG gene expression pro- 
files [40], comprising 516 patients categorized as astrocytoma (33%), oligodendroglioma (34%), and oligoas- 
trocytoma (22%). A protein-protein interaction (PPI) network was derived from the STRING database using 
a threshold of combined evidence score set to 0.7 [39], resulting in an undirected PPI network with 13,562 
nodes and 277,172 edges (Figure 1E). A sparse network was constructed by retaining the top 5% edges based 
on edge vector deviation (Figure 1A; Figure 2C), yielding 5,681 nodes and 13,643 edges. The subnetwork size 
(n) was set to 4 for balance between robustness and computational efficiency, generating a total of 25,413,392 
4-node subnetworks through subnetwork enumeration. 

We investigated diverse properties of subnetwork feature sets to determine the optimal threshold for 
inner-pattern consistency in subnetwork selection. Decreasing the threshold led to an incremental rise in 
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Fig. 3: Patient Groups and Subnetwork Clusters (A) Distance from background versus clustering sta- 
bility from different inner-pattern consistency thresholds. 0.87 is highlighted in red. (B) Self-organizing map 
with 100 units. Patients were mapped to the units, with different shapes representing different patient groups. 
Units were also annotated with groups by majority voting. (C) Heatmap of subnetwork versus patients. LGG 
patients were clustered into 5 groups (LG1~5) by consensus clustering using Euclidean distance. Subnet- 
works were clustered into 8 clusters by consensus clustering using absolute Pearson correlation distance. The 
sign of each subnetwork vector was adjusted to positively correlate with selected oncogenes or driver genes. 


subnetwork inclusion in each feature set until saturation (Figure 3A). Clustering, based on subnetwork state 
matrices formed from the first principal component of subnetwork expression (Figure 2F), was then assessed 
for stability across thresholds. Interestingly, clustering stability peaked at both ends of the threshold curve 
for cluster numbers between 4 and 7 (Figure S1B), indicating distinct clustering patterns between high and 
low-threshold feature sets. Employing stability curves, we selected 5 clusters based on the relative change of 
cumulative distribution function (CDF) area (Figure S2E) [41]. ch kale 

Upon fixing the cluster number at 5, 
we applied the selection algorithm without a 
threshold to create a background for compar- 
ison against feature-based clustering (Figure 
S2C). The transition from background to high- 
threshold clustering was evident by a sharp in- 
crease around threshold 0.8. Examining the re- 
lationship between clustering stability and dis- 
tance from the background revealed optimal 
thresholds (0.80 to 0.87) with high stability 
and separation (Figure 2A). Opting for 0.87 
over 0.83 and 0.85, we selected a threshold con- 
ducive to subsequent steps. 0 50 100 150 200 

Patient samples were clustered based on Time (Months) 
subnetwork state matrices derived from a fi- 
nal feature set of 92 subnetworks. Subnetwork fig. 4: The Kaplan Meier Plot shows the survival analysis 
sizes ranged from 6 to 11 nodes, predominantly for the TGGA patient groups based on TCGA prognostic 
comprising 6-node subnetworks (57%). Con- networks. The p— value < 4.1 — e15 show that groups have 
sensus clustering with Ward’s method (10,000 distinct survival patterns. 
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Fig. 5: Characterization of Patient Groups (A) Comparison of patient groups with current subtypes 
and clusters. (B) Relationship between patient groups and significant gene mutations. (C) Methylation of 
MGMT promoter and mutation of TERT promoter ordered by patient groups. 


iterations) generated a heatmap ordered by clustering dendrogram, revealing 5 patient groups exhibiting 
distinct subnetwork state patterns (Figure 3). Validation of the consensus clustering approach using unsu- 
pervised self-organizing map affirmed unbiased clustering (Figure 3B). 

To annotate subnetworks, we performed consensus clustering on subnetwork vectors, identifying 8 sub- 
network groups (SNG1~8). Genes within each group were divided into 2 clusters by correlations. Notably, 
SNG3 and SNG4 were enriched in cancer driver genes, with SNG4 housing 4 oncogenes associated with the 
p53 pathway. Protein classes and biological processes analysis further revealed significant associations with 
specific subnetwork groups, illuminating potential biological implications (Supplementary Table $2-S3). 

Additionally, we explored the correlation between subnetwork vectors and clinical attributes like Karnof- 
sky performance score and telomere length (Supplementary Table S6). Remarkably, SNG5 and SNG8 were 
significantly correlated with Karnofsky scores (p—value < 8.5e—06 and p— value < 5.0e — 03, respectively). 
Further, gene cluster 2 of SNG5 contained driver genes linked to mental illnesses (Supplementary Table $7). 
Telomere length showed significant association with SNG3, SNG6, and SNG8 (p — value < 0.021), reinforc- 
ing links between chromatin remodeling and telomere regulation. Notably, NIPBL and KALRN emerged as 
promising gene candidates correlated with distinct patient subgroups, emphasizing their potential roles in 
promoter regulation and neuropathological disorders. 


3.2 LG3: A Previously Unidentified Patient Group with Distinct Features 


A comparison of our patient groups with TCGA subtypes and clusters demonstrated LG1-3’s alignment with 
known LGG subtypes. However, LG3 defied such classification, signifying a novel patient group unnoticed in 
prior TCGA studies (Table $5). Intriguingly, LG3 exhibited a unique clinical profile and subnetwork state 
pattern. 

LG4 exhibited the highest proportion of grade-3 tumors and the oldest mean age (Figure S3A-B), 
accompanied by the worst Karnofsky performance score (Table S6). LG2 included relatively younger patients 
compared to LG1, LG3, and LG5. Telomere length analysis showcased pronounced shortening in LG4, 
consistent with previous research (Figure S3C) [42]. Notably, LG3 displayed a distinct advantage with the 
highest proportion of patients exhibiting high Karnofsky scores (> 90). 
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Survival analysis further underscored the significance of LG3, presenting improved survival compared to 
other groups, including LG1, LG2, and LG4, which mirrored [DHmut-codel, I[DHmut-non-codel, and [DHwt 
subtypes (Figure 4). Decision tree modeling unveiled key subnetworks (SNG4 and SNG5) driving LG3’s 
unique clinical outcome (Figure S4). 

Methylation analysis elucidated distinct genomic characteristics of LG3, marked by a scarcity of EGFR, 
NF1, and PTEN mutations, which could potentially contribute to its favorable prognosis. Additionally, 
supervised learning revealed methylation of NJPBL and KALRN as distinguishing features of LG3, offering 
novel insights into regulatory mechanisms and neuropathological associations. 


3.3 Comparison with existing methods 


First, we employed K-means clustering, hierarchical clustering, Principle Component Analysis and Consensus 
Clustering to determine subtypes of diseases based on mRNA gene expression profiles alone. While the groups 
had significant survival differences, the clusters did not follow any particular pattern and the number of genes 
was extremely high to discover any particular pattern from these analysis (Figure S5). 

We also compared our method to sample classification from gene expression data by the method of 
nearest shrunken centroids [29]. We were able to stratify the samples into four distinct classes by utilizing 
sample differences based on correlation analysis. This classification informed the selection of an optimal 
gene inclusion threshold through a rigorous cross-validation procedure (PAMR package in R). Subsequently, 
we refined our original genomic matrix to incorporate only these curated genes. A tailored Protein-Protein 
Interaction (PPI) subnetwork was generated. This started with integrating the genomic expression matrix 
with the PPI network, followed by node pruning and edge filtration. High-correlation edges were selected using 
a stringent threshold to create subnetworks, revealing gene pairs with potential interconnected functionalities. 
While consensus-based clustering for both the PAMR-refined matrix and the PPI subnetwork yielded Kaplan 
Meier Plots with statistically significant survival differences, (Figure S6), the clusters had no discernible 
feature to study (Figure S7). Saba Racial Gunes 


p = 1.8e-07 


1.0 


3.4 Validation of LGG Patient Groups 
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Group 3 
Group 4 
Group 5 


To ascertain the robustness of our patient 
groups, we validated our findings using an in- 
dependent dataset, CGGAgo3. Through this 
validation, we verified the consistent cluster- 
ing of LGG patients into LG1-5, confirm- 
ing the existence and preservation of distinct Tiy = 
subnetwork-based patient groups across differ- a J Be RE a 
ent datasets and platforms. Further survival = 
analysis validated the prognostic significance 
of these patient groups (Figure 6). 3 Ra as a a EA 
The subnetwork feature vectors from the 
TCGA dataset retained their ability to char- 


acterize the CG'GAgg3 dataset (Figure 7), so-  __ l l , 
lidifying the robustness and generalizability Fig. 6: The Kaplan Meier Plot shows the survival analysis 


of our approach. The relationship between for the CGGA patient groups based on TCGA prognostic 
TCGA groups (LG1-5) and CGGA groups fur- network. The p — value < 1.8 — e07 show that groups have 
ther confirmed the concordance between these distinct survival patterns in this secondary data as well. 
datasets. Importantly, the conserved survival traits of LG1-5 across datasets validated the clinical relevance 
of our patient groups, offering a promising avenue for refined LGG prognosis and treatment strategies. 
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4 Discussion 


Many researchers have proposed subtypes of LGG over the last decade. Classification based on genetic 
features rather than histological features has been demonstrated to be more biologically relevant. The most 
widely accepted classification is based on molecular subtypes, which classify LGG patients into three clusters 
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Fig. 7: The CGGA patient groups are based on the random forest model trained by the 92 prognostic networks 
of TCGA data. The 257 lower-grade glioma patient samples (filtered by WHO grade) were clustered into 
5 groups (group 1~5) by consensus clustering using Euclidean distance and the same 92 network measures 
calculated from the expression data provided. The barplot shows clinical features reported by CGGA [35-37]. 
Note that IDH1 wildtype group is identified as LG4 in this unsupervised approach once more. 


based on IDH mutation and chromosome 1p/19q co-deletion. However, recent studies have challenged this 
classification by suggesting that TERT may play an important role in glioma development. Despite the 
increasing specificity of LGG classification, the underlying mechanisms of these biomarkers remain unclear. 
For instance, patients with IDH wildtype genotype experience the worst survival outcomes. However, if 
they have both TERT and IDH mutations, their survival length is significantly extended, forming the best 
survival group. This suggests the existence of synergistic relationships among driver genes in LGG. 

In this context, our developed algorithm, nSFA, offers insight into characterizing these tumors by 
capturing dysregulation within pathways. Unlike common bioinformatics approaches that focus on mutations, 
methylation, and copy-number variation, our approach employs a different methodology. By scanning over 
nearly thirty million 4-node subnetworks, we provide a comprehensive view of subnetwork states within 
LGG. Through feature selection based on clustering statistics, we identify 92 subnetworks that categorize 
LGG patients into 5 groups. Three of these groups can be mapped to the general subtypes, demonstrating 
the ability of our algorithm to capture biologically significant signals. Additionally, we uncover one patient 
group, LG3, which not only exhibits distinct subnetwork states but also holds clinical significance. We further 
validate these patient subtype groups using a second cohort, showing that survival traits are conserved even 
across different patient populations. 

Further analysis reveals that compared to other groups, LG3 demonstrates the best survival and Karnof- 
sky performance score. The decision tree model trained on LG3 suggests that SNG4 and SNG5, enriched 
with oncogenes and associated with mental disorders respectively, can effectively distinguish LG3 from other 
patients with high accuracy. Mutation analysis indicates that LG3’s improved clinical performance may be 
attributed to the absence of mutations in EGFR, NF1, and PTEN. Moreover, a tree model based on methy- 
lation data highlights NJPBL and KALRN as two genes responsible for the primary and secondary splits of 
the tree respectively. Apart from their roles in transcription regulation through promoters, NIPBL has been 
linked to various types of cancers [43], suggesting its potential importance in gliomagenesis. The protein 
encoded by KALRN, Kalirin, belongs to the RhoGEF protein family, several members of which have been 
identified as cancer driver genes [44]. The Dbl-homologous domain of this protein could potentially become 
a target for future drug development [45]. 

The unsupervised nSEA approach also identified high percentages of cancer driver genes in each sub- 
network group. These networks underscore the biological significance of the subnetworks captured by nSEA. 
The synergistic nature of driver genes has been extensively studied in the past, and nSEA networks provide 
insights into how driver genes synergistically contribute to tumor progression. Our findings offer valuable 
insights based on correlation analysis. However, it is imperative to establish causative relationships in order 
to gain a deeper understanding of each subtype. Driver mutations and epigenetic events warrant further 
investigation to delineate these causative relationships. While our approach involved feature selection to cat- 
egorize patients into groups, numerous driver genes that could differentiate patient groups were identified. 
Any drivers not included could be further explored using nSEA networks to better understand their roles in 
gliomagenesis. 
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The availability of multiple publicly-available datasets studying the same phenomenon has the 
promise of accelerating scientific discovery. Meta-analysis can address issues of reproducibility 
and often increase power. The promise of meta-analysis is especially germane to rarer diseases like 
cystic fibrosis (CF), which affects roughly 100,000 people worldwide. A recent search of the 
National Institute of Health’s Gene Expression Omnibus revealed 1.3 million data sets related to 
cancer compared to about 2,000 related to CF. These studies are highly diverse, involving different 
tissues, animal models, treatments, and clinical covariates. In our search for gene expression 
studies of primary human airway epithelial cells, we identified three studies with compatible 
methodologies and sufficient metadata: GSE139078, Sala Study, and PRJEB9292. Even so, 
experimental designs were not identical, and we identified significant batch effects that would have 
complicated functional analysis. Here we present quantile discretization and Bayesian network 
construction using the Hill climb method as a powerful tool to overcome experimental differences 
and reveal biologically relevant responses to the CF genotype itself, exposure to virus, bacteria, and 
drugs used to treat CF. Functional patterns revealed by cluster Profiler included interferon 
signaling, interferon gamma signaling, interleukins 4 and 13 signaling, interleukin 6 signaling, 
interleukin 21 signaling, and inactivation of CSF3/G-CSF signaling pathways showing significant 
alterations. These pathways were consistently associated with higher gene expression in CF 
epithelial cells compared to non-CF cells, suggesting that targeting these pathways could improve 
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clinical outcomes. The success of quantile discretization and Bayesian network analysis in the 
context of CF suggests that these approaches might be applicable to other contexts where exactly 
comparable data sets are hard to find. 


Keywords: Cystic Fibrosis, Bayesian Network, Data. 
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1. Introduction 


Worldwide initiatives are currently discussing the principles of acquiring, standardizing, 
storing, and making scientifically produced data accessible for reuse. However, one of the key 
difficulties is addressing the heterogeneity of the data, which is called batch effects. These 
batch effects occur when we compare multiple datasets obtained from different laboratories, 
platforms, or processed at different time points. These internal differences can lead to 
misinterpretations of the results and it is not only a common issue in omics data analysis but in 
many cross-study comparisons.!4 In recent years, there has been increasing consideration of 
batch effects in data analysis and several approaches have been proposed to address them. The 
simplest way to handle batch effects is to include them in the statistical model during analysis. 
Other approaches involve estimating and creating a new dataset adjusted by batch effects, to 
perform the statistical analyses. However, it is important to note that this technique can reduce 
Statistical power, particularly when the batch-group is unbalanced, meaning that batch 
differences may be influenced by group differences. This correction can either diminish group 
differences or introduce new batch effects due to errors in batch effect estimation that may be 
inflated by false positives.> 


Cystic fibrosis (CF) is a recessive genetic disorder characterized by alterations in 
electrolyte transport across polarized epithelia resulting from mutations in the CF 
transmembrane conductance regulator gene (CFTR). Numerous studies on CF have identified 
similarities or specific gene signatures that are closely related.48 However, the amount of 
available transcriptomic datasets for reanalysis and comparison is continually growing.2 
Integrating data from diverse sources can provide a more comprehensive understanding of 
underlying biological processes that may not be evident from individual studies alone, 
especially when dealing with multiple conditions and distinct variables.12 The Meta-analysis 
instrument of individual microarray studies on CF can help assess the connections between 
respiratory disorders at the transcriptomic level and provide insights for pathway analysis, but 
deal with several conditions like: usage of antibiotics, type of mutations, infections by virus or 
bacteria. 12 


Meta-analysis is a statistical tool that allows the analysis of results from different scientific 
studies conducted in different locations or by using different methods." In the late 1990s, 
network meta-analysis (NMA), also known as multiple-treatments or mixed-treatment 
comparison meta-analysis was introduced as an extension to standard meta-analysis. NMA 
can compare multiple treatments simultaneously, even when direct comparisons are lacking in 
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existing studies.2 One systematic review of NMA methods found that around two-thirds of 
NMA studies utilized a Bayesian approach.“ The Bayesian network (BN) models are 
promising in the medical field because they represent the relationships between variables based 
on real-world, making them more contextually meaningful than purely numeric associations4 
It has been used in various areas of medical science and can include different types of 
variables, such as clinical, diagnosis, prognosis, and symptoms. This versatility allows 
researchers to integrate prior beliefs with sample data and BN analysis has recently been 
utilized in epidemiology, public health, and medicine.43@ On the other hand, there is limited 
knowledge about BN meta-analysis, which may be attributed to researchers' lack of 
understanding or familiarity with Bayesian methods. Nevertheless, there is significant potential 
for the application of BN meta-analysis in medicine. + 


Standard meta-analysis only allows for comparing two interventions at a time, whereas BN 
Meta-analysis enables the inclusion of evidence from both direct and indirect comparisons in a 
single analysis.12 However, BN analysis interpretations still require specific assumptions for 
accuracy of the algorithm learning and interpretation of network structure, making it a 
challenging task. To address these issues inherent in Meta analysis, our study proposes a 
novel approach to pairing multiple transcriptomic datasets by quantile discretization and 
integrating metadata variables in a new BN Meta Transcriptomic analysis. This approach aims 
to provide new and valuable insights into understanding the complexities of a multifactorial 
disease like CF. 


2. Methods 


2.1. Data Selection 


We accessed datasets available in the Gene Expression Omnibus (GEO) database 
(https://www.nebi.nlm.nih.gov/geo/) by searching the keyword "cystic fibrosis". A total of 17 
datasets were returned by this query, which was performed in November 2022. Nine datasets 
were excluded from further analysis due to methodological incompatibility or insufficient 
metadata, which involved the use of different cell tissues or experimental designs and did not 
measure the same patients variables. We retrieved metadata for these three studies. Three of 
these studies measured gene expression in airway epithelial cells. The first dataset 
(PRJEB9292), published by Balloy et al.,/8 included both non CF and CF epithelial cells 
infected with Pseudomonas aeruginosa for different time points. The second dataset 
(GSE139078) £ involved epithelial cells from CF patients infected by Rhinovirus or control 
and treated with Ivacaftor or Lumacaftor/ivacaftor, modulator drugs used to enhance the 
functional of CFTR. The third study% included two datasets: a pilot dataset with 13 samples 
and a validation dataset contained 35 samples. All datasets provided patient genotype, 
modulator information, and infection status with either Pseudomonas aeruginosa or 
Rhinovirus. 


2.2. Data Harmonization and Analysis 


The metadata description included means and standard deviation for numeric variables and 
frequencies and percentages for categorical data. RNAseq datasets were individually 
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normalized by library size and log CPM (count per million) transformation and differential 
expression analyses were performed individually for each dataset using deseq2.*! In the Balloy 
dataset, we compared CF vs. non-CF infected or not infected with Pseudomonas aeruginosa; in 
the De Jong dataset, CF epithelial cells infected with virus or not infected with virus; and in the 
Salas dataset, epithelial cells of CF patients compared to non-CF subjects. In this exploratory 
design, the DEGs were used to filter the large number of targets, and they were determined by 
applying specific criteria: genes with a P-value less than 0.05 and a log2 expression fold 
change greater than 1 or less than -1 were considered as differentially expressed. These criteria 
were chosen to serve as a filter and help reduce processing time. Each study was normalized 
individually, and each gene was discretized according to sample distributions. The count table 
with filtered genes were discretized into quartiles (1st - Minimal to 25%, 2nd - 25% to 50%, 
3rd - 50% to 75%, and 4th - 75% to maximum values by sample distribution) using 
Hartemink's algorithm, which is available in the bnlearn package.22#2 Afterward, all the 
transformed transcriptomic datasets were merged into a single discretized dataset, to which 
metadata was added. The learning algorithm used to establish the Bayesian network structure 
was based on the heuristic Hill climb method.242° Bayesian network learning was used to 
visualize conditional dependencies between multiple clinical and transcriptome variables. 
The dependencies are represented qualitatively by a directed acyclic graph where each node 
corresponds to a variable and a direct arc between nodes represents a direct influence. 
Robustness of the arcs was scored by a non-parametric bootstrap test (100xreplicates).27 For 
functional analysis of genes related to CF, virus infection, bacterial infection, and use of 
modulators, enrichment pathway analysis was performed using the clusterProfiler package and 
REACTOME geneset.7822 For the Pathway meta-analysis we use the qusage package.*2 All 
analyses were performed in R version 4.0.2% and the Bayesian network and discretization 
scripts are available in github (https://github.com/FfKB/BNCF). Figure 1. presents a summary 
of the study selection os and experimental design. 
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Figure 1. Experimental Design. A) Diagram illustrating the study selection process using a 
Sankey diagram. The excluded datasets are highlighted in red, while the eligible datasets are 
highlighted in green. B) Flowchart depicting the data processing steps in the study. 


3. Results 


3.1. Study descriptions 


A total of three studies comprising four datasets were considered for analysis: GSE139078, 
Sala Study, and PRJEB9292. The GSE139078 dataset consists of CF patients who were 
infected with rhinovirus (RHV). The PRJEB9292 dataset includes four patients divided into 
four time points, enabling a comparison between gene expression in non CF subjects and CF 
patients infected with Pseudomonas aeruginosa. The Sala study included two datasets: the 
pilot study and the validation study, which involved a comparison of gene expression profiles 
between CF patients and non CF subjects. The analysis also includes the assessment of 
modulator use (Lumacaftor and Ivacftor; and Ivacftor alone) in three datasets (GSE139078, 
Sala Pilot, and Sala Validation). All CF patients included in these studies have the 
F508del/F508del genotype, a common genetic mutation (~50%) associated with CF. However, 
sex and age data were not available for all the datasets, thus, that metadata was not included in 
the Bayesian Network Analysis. These carefully selected datasets provide comprehensive 
insights into gene expression patterns related to CF, considering factors such as viral and 
bacterial infections and the influence of modulators (Table 1). 


Table 1. The characteristics of subjects from the selected datasets. 
GSE139078 Sala Pilot Sala Validation PRJEB9292 
Male sex, n(%) 48 (84.2) - - 


Age, mean (SD) 3.4 (1.4) [35.3 (5.3) [34.1 (8.2) |- 
Infection by virus, n(%) 38 (66.7) |- |- |- 
Infection by P. aeruginosa, t t t a (100) 
n(%) 

Cystic Fibrosis, n(%) |57 (100) 7 (53.8) |24 (68.6) M (50)* 
Modulators (Luma/Iva), n(%) |10 (17.5) [2 (15.4) |10 (28.6) |- 
Modulators (Ivacaftor), n(%) [9 (15.8) lo (0) [2 (5.7) |- 
Genotypes F508del, n(%) 57 (100) 7 (53.8) 24 (68.6) 16 (50) 


* = 4 Patients in 4 different timepoints (0, 2, 4 and 6). 


3.2. Filtering gene expression data for use in the model 


We began by selecting significant genes through a conventional RNAseq comparison 
within each dataset. In the Sala Pilot and Validation studies, we compared patients with CF 
against non-CF individuals to identify genes associated with CF in these datasets. The De Jong 
datasets exclusively included CF samples, so we compared the presence or absence of virus 
infection. Lastly, the Balloy dataset consisted of different time points of infection by 
Pseudomonas aeruginosa, with an uninfected control established as point zero for comparison. 
In all of the studies, we observed changes in gene expression across various comparisons, such 
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as CF versus non-CF, presence or absence of virus, and infection by Pseudomonas aeruginosa. 
It gave us an idea about which genes should be integrated in our Bayesian Network Model. In 
the De Jong study, we identified 280 genes (220 up-regulated and 60 down-regulated) (Figure 
2A). In the Balloy study, we identified 350 genes (221 up-regulated and 129 down-regulated) 
(Figure 2B). In the Sala pilot study, we identified 789 genes (639 up-regulated and 150 down- 
regulated) (Figure 2C), and in the Sala validation study, we identified 2716 genes (2114 up- 
regulated and 602 down-regulated) (Figure 2D). The differences between all the comparisons 
can be accessed for both up-regulated genes (Figure 2E) and down-regulated genes (Figure 
2F). 
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Figure 2. Differential gene expression analysis of epithelial cell datasets. A) GSE139078 
shows gene expression changes in cystic fibrosis (CF) patients infected with a virus compared 
to non-infected CF patients. B) PRJEB9292 compares gene expression in CF patients and 
controls infected with Pseudomonas aeruginosa (P.a). C and D) The Sala Cohort dataset 
compares gene expression between CF patients and non CF subjects in a pilot study (C) and 
validation study (D). Red dots represent significant genes with fold changes above or below 
+0.5, blue dots represent significant genes without fold change variation, and green dots 
represent non-significant genes with fold change variation. E and F) Venn diagrams represent 
the overlap and exclusivity of differentially expressed genes (DEGs) in each comparison, using 
the upregulated (>1 fold change and p-value <0.05) and downregulated (<-1 fold change and p- 
value <0.05) DEGs. 
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3.3. The Bayesian network is capable of identifying genes associated with all conditions and 
covariates. 


To circumvent experimental design limitations and to measure the relationship between all 
conditions and covariates present, we discretized the log CPM table and retrieved all the 
significant genes obtained from all comparisons of each dataset combined with its respective 
metadata (infection type (viral or bacterial), CF, modulators (Luma/Iva or Ivacaftor) and 
genotype (F508del or non CF controls) to create a new dataset. In total we included 1976 
genesin the Bayesian network model. As a result, the Bayesian network reveals which genes 
have a direct relationship with the presence of bacteria, virus, usage of modulators, CF, and the 
genotype (F508del). Each condition has its own network community despite the genotype, and 
it is associated with the presence of CF (Figure 3A). Genes present in each network community 
were used for functional analysis. The functional analysis revealed an Interferon signaling 
(alpha/beta and gamma) associated with CF, virus, and bacterial network communities. 
However, IL-9, IL-21, and IL-6 signaling were exclusively related to CF. Virus exposure was 
exclusively associated with the TGF-beta pathway, and the bacterial exposure did not have any 
exclusive pathway. Modulator treatment was associated with the response of EIF2AK1 to 
heme deficiency, late endosomal microautophagy, and IL-1 signaling (Figure 3B). 
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Figure 3. Bayesian Network signatures associated with cystic fibrosis (CF), infection, and 
mediators. Associations were extracted using Bayesian Network analysis and reconstructed 
using the "igraph" package in R. A) The main variables (CF, mutations, mediators, and 
infection) are represented by red nodes and clusters are depicted with red dotted lines. B) 
Genes presented in each cluster were used for over-represented pathway analysis. 
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3.4. The CF Bayesian signature pathway is consistent across all datasets and shows higher 
expression levels when compared to non-CF epithelial cells. 


The pathways that were discovered in the Bayesian Network Analysis, related to CF were 
subjected to qusage pathway meta-analysis to measure their activation levels in each study 
individually, as well as their combination across all studies. As a result, the Interferon 
signaling, interferon gamma signaling, interleukin 4 and 13 signaling, interleukin 6 signaling, 
interleukin 21 signaling, and Inactivation of CSF3 G-CSF signaling pathways exhibited an 
overall alteration across all studies with significant p-values, while the pathways Interleukin 9 
signaling and Signaling of TBF-g receptor complex were not significant (Figure 4). We 
investigated the gene composition of these significant pathways in CF and non-CF to 
understand their expression. Across all significant pathways investigated (Figure 5). A) 
Interferon signaling, B) Interferon gamma signaling, C) Interleukin 4 and interleukin 13 
signaling, D) Interleukin 6 signaling, E) Inactivation of CSF3 G CSF signaling, and F) 
Interleukin 21 signaling. The analysis revealed a considerable proportion of epithelial cells 
derived from CF patients displayed a heightened expression of these genes present in the upper 
quartile (+75%), in comparison with non-CF. These genes were poorly expressed in all 
samples in the quantile transformed integrated dataset (Figure 5). 
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Figure 4. Meta-analysis of pathway enrichment across datasets. The accumulated pathway 
analysis between all studies was conducted using the pipeline available in the qusage package. 
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Dotted lines separate studies by color: red for GSE139078, blue for Sala pilot study, purple for 
Sala validation study, and green for PJREB292. Significant pathways increased related to 
cystic fibrosis (CF) were identified, including A) Interferon signaling, B) Interferon gamma 
signaling, C) Interleukin 4 signaling, D) Interleukin 6 signaling, E) Interleukin 9 signaling, F) 
Interleukin 21 signaling. Pathways decreased in CF include: G) Inactivation of CSF3 and G- 
CSF signaling, and H) Signaling by TGF-beta receptor complex. 
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Figure 5. Quantile distribution of expressed genes in each significant pathway related to cystic 
fibrosis (CF). A) Interferon signaling, B) Interferon gamma signaling, C) Interleukin 4 and 
Interleukin 13 signaling, D) Interleukin 6 signaling, E) Inactivation of CSF3 and G-CSF 
signaling, and F) Interleukin 21 signaling. 


4. Discussion 


Integrating data from transcriptomics or other high-throughput systems, such as 
proteomics, metabolomics, and lipidomics, is expected to yield new insights. Unfortunately, it 
also introduces significant heterogeneity arising from various designs or methodologies, 
commonly known as batch effects. Batch effects are pervasive across all types of high- 
throughput biological platforms, including single measurement methods like PCR or ELISA.#4 
When performing a meta-analysis, batch effects may create bias and reduce statistical power, 
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making it challenging to detect all relevant features, especially those with small effect sizes or 
in unbalanced samples.t On the other hand, integrating several smaller datasets theoretically 
improves statistical power, provided that technical heterogeneity, including batch effects, is 
effectively resolved. 


Efforts to mitigate batch effects have been proposed, as they are known to interfere with 
downstream statistical analysis, potentially introducing false significance between groups that 
only exist between batches without biological meaning.2243 Batch effects can also lead to the 
loss of biological signals contained in the data.*435 The proposed quantile transform approach 
tends to be respectful of each dataset's characteristics, and by mapping each variable's 
probability in a probabilistic graphical model, it can handle variables present in the metadata, 
such as group allocation, clinical data, and dichotomous variables, which can be added and 
probabilistically related to each other.2® To achieve this, we evaluated four distinct Cystic 
Fibrosis Datasets with CF genotype, modulator therapy, and different types of infection, 
incorporating gene expression with these variables, without applying any batch correction 
while respecting each dataset's individuality. This approach has demonstrated a high level of 
accuracy in classifying cancer types when applied to expression datasets. 37 


To reduce processing time, we filtered the genes by selecting those that were differentially 
expressed in all datasets. For the Baloy dataset, we identified 350 differentially expressed 
genes (221 upregulated and 129 downregulated genes). In their original publication,!® the 
authors found a significantly higher number of upregulated genes than down regulated genes 
compared to noninfected control cells, although their comparisons were done at each time 
point. In our study, we bulked the controls and the Pseudomonas aeruginosa infection time 
point 0 as a control and compared to Pseudomonas aeruginosa infection. In De Jong's study,!2 
the author separated the cells by classes and made two different comparisons: virus infection 
versus controls and virus infections plus modulator with either Ivacaftor or 
Ivacaftor/Lumacaftor. We compared all cells together against the controls and identified 195 
upregulated genes and 60 downregulated genes. In the study by Sala et al., our comparisons 
were similar, with 639 and 2114 upregulated genes in the pilot and validation datasets, 
respectively, and 568 and 1834 downregulated genes, and 150 and 112 upregulated genes, and 
320 and 403 downregulated genes in our analysis, respectively. Differences can be noticed 
between the studies not only in how the comparisons were done, but also in the methods used 
for comparisons. In our study, all the analyses were performed with the DESEQ2 package,” 
whereas Sala and De Jong's studies used edgeR.*8 


The pathway analysis performed by Balloy!® and Sala% did not use the same geneset. In 


our study, we used the Reactome geneset *°, and only De Jong! used Reactome geneset as 
well. However, the inflammatory responses were similar in all studies. In Sala's study, they 
associated the chaperone pathway in CF, while in our study, it was associated with the 
modulators. Other pathways, such as Interleukin 6, 9, and 21, were exclusively associated with 
CF in our analysis. The role of IL-6 is controversial; however, it participates in 
proinflammatory responses with TNF-a and interleukin-1b. IL-6 is a regulator of the host 
inflammatory response and is negatively associated with pulmonary function in chronic 
infection in CF and during acute exacerbation of respiratory symptoms or during a period of 
apparent clinical stability. In bronchoalveolar lavage fluid, IL-6 was significantly elevated in 
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infants with CF.® Increased expression of IL-9 and IL-9R is responsible for the mucus- 
overproducing in the lung epithelium of patients with cystic fibrosis and IL-21 is a 
multifunctional cytokine that acts on various immune cells.“ Interestingly, in mice fibroblasts, 
IL-21R is expressed and upregulates matrix metalloproteinases in response to IL-21 by CD8+ 
T cells. 


When it comes to viral infection, we found that viruses have only one exclusive pathway 
associated with our analysis, which is related to TGF-b signaling. This pathway is involved in 
pulmonary fibrosis and other organ-related processes. Viruses utilize various mechanisms to 
modulate this pathway, including altering TGF-b protein expression and its receptors, as well 
as modulating the SMAD cascades, TGF-b lead to enhanced cell growth and induction of 
fibrosis. On the other hand, bacterial infection does not influence any pathways in our 
analysis. As for the use of modulators, we identified three exclusive pathways: "Response of 
EIF2AK1 to heme deficiency," "late endosomal microautophagy," and "IL-1 signaling". The 
HRI kinase (or EIF2AK1) plays two main roles during development: it ensures a balanced 
synthesis of globin and heme and promotes the survival of erythroid precursors during iron 
deficiency. Inhibitors of P-gp (P-Glycoprotein) such as fostamatinib*® and Ivacaftor can be 
associated with various stress conditions, including oxidative stress, heme deficiency, osmotic 
shock, and heat shock.*! In the context of CF, the usage of modulators is associated with an 
autophagy pathway, which compromises CFTR recycling to lysosomal degradation.*® 
Moreover, in our study, the genes associated with modulators were linked to this pathway. In 
CF patients, CFTR modulators have been shown to increase airway nitric oxide (NO) by 
increasing the concentrations of IL-la, IL-1B, and other Th17-associated cytokines in sputum, 
which is related to NO metabolism.” 


The overall pathway activation in all studies discovered by the Bayesian network approach 
in CF confirms previous studies describing a hyperinflammatory state in CF, as well as the 
participation of other pathways such as interleukin 4, 6, 13, and 21. Notably, interleukin 4 and 
13 were not exclusively associated with CF status. The roles of IL-4 and IL-13 in the 
epithelium of CF patients share several biological properties, including chloride secretion. On 
the other hand, IL-4 inhibits antiviral immunity,*! and neutralization of IL-13 reduces death 
and disease severity in COVID-19 without affecting viral load, indicating an 
immunopathogenic role for this cytokine. Additionally, G-CSF and GM-CSF can induce 
elastase and MMP-9 release by neutrophils *3. Interestingly, all the genes presented in the 
pathway analysis were in the last quantile of expression in our dataset. The main limitation of 
this study is that it serves as the initial proof of concept for quantile discretization in the 
integration of raw datasets. A comparison with different methods should be conducted. 
Additionally, clinical non-numeric data were included in a single analysis. Therefore, this 
analysis must be interpreted carefully and should serve as a guide for future models aiming to 
integrate all datasets and variables in a similar manner. Unfortunately, this study was limited to 
using only four CF datasets due to the considerable challenge of aligning complete metadata, 
which encompasses treatment, genotype mutation profiling, and infection status. It is 
uncommon to find metadata with all these features available, and new studies using this 
approach must be conducted to assess its efficacy. Despite these limitations, this study sheds 
light on various biological processes related to CF, particularly concerning viral and bacterial 
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infections, as well as the impact of modulators on epithelial cells within a single assessment, 
providing valuable insights into these complex. 


5. Conclusion 


The analysis of integrated data remains a powerful hypothesis generation tool among data 
scientists. However, dealing with the heterogeneity of multiple datasets poses real challenges. 
In this study, we proposed a novel approach to integrate several datasets while respecting the 
unique characteristics of each individual dataset. By applying quantile transformation to 
multiple datasets and integrating them, we obtained biologically meaningful results that align 
with existing literature and established associations with other variables such as modulators, 
virus, and bacterial infections, and included access to good quality metadata. Our analysis 
revealed an inflammatory signature in CF patients, with exclusive associations observed in 
interleukin 4, 6, 13, and 21 pathways. Furthermore, we identified potential links between virus 
infections and the TGF-b pathway, as well as associations between modulators and pathways 
such as "Response of EIF2AK1 to heme deficiency," "late endosomal microautophagy," and 
"IL-1 signaling." These findings contribute to a better understanding of the complex 
interactions in CF and highlight potential targets for further research and development of new 
integration protocols. Nonetheless, additional studies employing this methodology are 
imperative to determine the extent to which this innovative approach can uncover novel 
associations compared to traditional methods. 
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Low- and high-level information analyses of transcriptome connecting endometrial-decidua- 
placental origin of preeclampsia subtypes: A preliminary study 
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Background. Existing proposed pathogenesis for preeclampsia (PE) was only applied for early onset 
subtype and did not consider pre-pregnancy and competing risks. We aimed to decipher PE subtypes 
by identifying related transcriptome that represents endometrial maturation and histologic 
chorioamnionitis. Methods. We utilized eight arrays of mRNA expression for discovery (n=289), 
and other eight arrays for validation (n=352). Differentially expressed genes (DEGs) were 
overlapped between those of: (1) healthy samples from endometrium, decidua, and placenta, and 
placenta samples under histologic chorioamnionitis; and (2) placenta samples for each of the 
subtypes. They were all possible combinations based on four axes: (1) pregnancy-induced 
hypertension; (2) placental dysfunction-related diseases (e.g., fetal growth restriction [FGR]); (3) 
onset; and (4) severity. Results. The DEGs of endometrium at late-secretory phase, but none of 
decidua, significantly overlapped with those of any subtypes with: (1) early onset (p-values <0.008); 
(2) severe hypertension and proteinuria (p-values <0.042); or (3) chronic hypertension and/or severe 
PE with FGR (p-values <0.042). Although sharing the same subtypes whose DEGs with which 
significantly overlap, the gene regulation was mostly counter-expressed in placenta under 
chorioamnionitis (n=13/18, 72.22%; odds ratio [OR] upper bounds <0.21) but co-expressed in late- 
secretory endometrium (n=3/9, 66.67%; OR lower bounds >1.17). Neither the placental DEGs at 
first- nor second-trimester under normotensive pregnancy significantly overlapped with those under 
late-onset, severe PE without FGR. Conclusions. We identified the transcriptome of endometrial 
maturation in placental dysfunction that distinguished early- and late-onset PE, and indicated 
chorioamnionitis as a PE competing risk. This study implied a feasibility to develop and validate the 
pathogenesis models that include pre-pregnancy and competing risks to decide if it is needed to 
collect prospective data for PE starting from pre-pregnancy including chorioamnionitis information. 
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1. Introduction 


Preeclampsia (PE) is one of pregnancy-induced hypertension (PIH) subtypes related to placenta and 
endothelial dysfunction [1, 2]. This disease makes the survival more susceptible to cardiovascular 
diseases later in life [3]. Many studies have proposed pathogeneses for PE [4]. Most of these were 
typical for the early-onset subtype and shared with those of fetal growth restriction (FGR) [5], 
whereas both PE and FGR were placenta dysfunction-related diseases (PDDs). Meanwhile, the 
early-onset subtype only contributed to <30% cases of PE [6]. Therefore, regardless of numerous 
proposed pathogeneses, the common etiology for most of the PE subtypes is still unclear. 

Worldwide, PE affected 3-8% pregnant women [7] and contributed to 11-18% maternal deaths. 
The risk of hypertension later in life increased 3.7 times for women with a history of PE and the 
onset was 7.7 years earlier with that of PIH, compared to women without PE or PIH, respectively 
[8]. Hypertension contributed to all-cause morbidities and mortalities in one fourth adults 
worldwide, although it is a modifiable risk factor [9]. This disease was more common in 
postmenopausal women compared to either men or the premenopausal counterparts and only 50% 
have controlled their blood pressure despite the well-awareness of necessary medications [10]. In 
addition, since the only cure is early delivery, PE was also the major contributor of prematurity and 
low-birth-weight infants [11], which led to neonatal deaths [12]. The preterm infants increased 
neonatal intensive care unit utilization and it was not reduced by the infants born from preeclamptic 
mother given preventive intervention using low-dose aspirin at 11—13 weeks’ gestation. Infants born 
from the preeclamptic mother also demonstrated signs of cardiac injury [13], which may increase 
risk of cardiovascular diseases later in life. Therefore, PE prevention has several impacts to mother 
and child healthcare, including the mortalities, morbidities, resources utilization, and cardiovascular 
diseases later in life. 

Improvements of prevention strategy for any subtypes of PE need understanding of the 
pathogeneses. These were commonly believed to occur in the first trimester of pregnancy based on 
timing for the most successful prediction [14]. Yet, most of the comparisons were made against 
those conducted at the next trimesters without considering pre-pregnancy period [15]. Enormous 
theories have been proposed for the first-trimester pathogenesis, culminated into 2-stage theory [16, 
17]. This consisted of two sequential dysfunctions in placenta and endothelium. However, the cause 
of pathophysiological derangement in placenta was still unclear [5]. Antecedents of this event were 
revealed by association between PE with either endometrial maturation [18] or metagenomics 
profiling of placenta [19]. There was a significant number of differentially expressed genes (DEGs) 
overlapped between those from preeclamptic chorionic villi sampling (CVS) and those from 
pathological endometrium [20]. But, the overlapped DEGs were regulated in the same direction 
instead of opposite ones, indicating the likelihood of co-occurrence instead of potential causal-effect 
relationship. Meanwhile, many publications describing association between microbiome and PE 
were proof-of-concept reviews instead of research articles. Eventually, PE remains a vascular 
disease with unknown etiology. This study aimed to identify transcriptome representing endometrial 
maturation and histologic chorioamnionitis, enriched by DEGs of the PE subtypes, using microarray 
meta-analysis workflow at low- and high-level information. 
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2. Materials and Methods 


2.1. Dataset integration 


A previous workflow on microarray dataset integration was applied [21]. We utilized 15 publicly- 
accessed microarray experiments of mRNA expression (n=653). The datasets were queried in Gene 
Expression Omnibus (GEO) and Array Express databases. To understand how these datasets helped 
in achieving the objective of our study, it is important to describe the spatial and temporal contexts 
of datasets in this study and the conditions they represented (Figure 1). Our datasets covered from 
pre- to post-pregnancy (post-partum) period. Pre-pregnancy period was represented by endometrial 
samples, while the pregnancy period was represented by either decidual (maternal side) or placental 
(fetal side) samples. The placental samples also represented the post-partum period in term of 
chronic/gestational hypertension phenotype, as defined by the original study. Gestational 
hypertension starts from 20 weeks’ gestation to 6 weeks after delivery, but this condition is 
considered as chronic hypertension if the elevated blood pressure persisted more than 6 weeks after 
delivery. Furthermore, our datasets also covered histologic chorioamnionitis, the PE subtypes, and 
hemolysis, elevated liver enzymes, and low platelets (HELLP) syndrome (mostly preceded PE, but 
may occur without PE). 

We utilized the datasets 1 to 8 for DEGs discovery sets (Figure 1; Tables S1 and S2), which 
were GSE4888 (n=27; dataset 1) and GSE6364 (n=37; dataset 2) for endometrium, E-MTAB-680 
(n=24; dataset 3) for decidua, GSE12767 (n=12; dataset 4) and GSE9984 (n=12; dataset 5) for CVS 
(i.e., the first-trimester placenta), and GSE75010 (n=157; dataset 6), GSE98224 (n=48; dataset 7), 
and GSE100415 (n=20; dataset 8) for third-trimester placenta of normotensive pregnant women and 
those with several subtypes of PE, other PIH, and other PDDs. The placental samples also consisted 
of those with and without either histologic chorioamnionitis or HELLP syndrome, but we included 
second-trimester placenta in addition to third-trimester placenta. All the discovery sets applied total 
RNA extraction. Endometrium datasets covered proliferative, and early-, mid-, and late-secretory 
phases. These four phases respectively represented endometrial maturation. Meanwhile, decidua 
datasets consisted of ectopic pregnancy (implantation site outside endometrium) without or with 
intermediate decidualization, and intrauterine pregnancy (implantation site inside endometrium) 
with intermediate and confluent decidualization. These four conditions represented decidualization 
from the lowest to the highest degree. For endometrium datasets, we excluded subjects with 
endometriosis and ambiguous histology reading of endometrial phases. For placenta datasets, we 
only included subjects with phenotypes that fitted the group definitions (see 2.2 Group definition). 
There were overlapped samples between GSE75010 and GSE98224 (n=48), but the duplicates were 
removed. There were no additional eligibility criteria applied for decidua and CVS datasets beyond 
those from the original datasets. 

For validation sets (Figure 1; Table S1 and S2), we utilized the datasets 9 to 15, which were 
GSE30186 (n=12; dataset 9) with GPL10558 Illumina HumanHT-12 V4.0 expression beadchip, 
GSE10588 (n=43; dataset 10) with GPL2986 ABI Human Genome Survey Microarray Version 2, 
GSE24129 (n=24; dataset 11) with GPL6244 Affymetrix Human Gene 1.0 ST Array (transcript 
[gene] version), GSE25906 (n=60; dataset 12) with GPL6102 Illumina human-6 v2.0 expression 
beadchip, GSE4707 (n=14; dataset 13) with GPL1708 Agilent-012391 Whole Human Genome 
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Oligo Microarray G4112A (Feature Number version), GSE44711 (n=16; dataset 14) with 
GPL10558 Illumina HumanHT-12 V4.0 expression beadchip, and GSE128381 (n=183; dataset 15) 
with GPL17077 Agilent-039494 SurePrint G3 Human GE v2 8x60K Microarray 039381 (Probe 
Name version). All the validation sets also applied total RNA extraction. Since most of the platforms 
were different, several DEGs might not be included in both discovery and validation sets 
corresponding the same subtype. Thus, we only used the intersected genes among them. 


Discovery sets Validation sets 


@ GseEssss © GSE30186 
@ GSE6364 @ GSE10588 
© E-MTAB-680 @ GSE24129 
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Figure 1. The spatial and temporal contexts of datasets in this study and the conditions they represented. 
HELLP, hemolysis, elevated liver enzymes, and low platelets. 


We did not apply data integration for discovery sets. The experiments using third-trimester 
placenta were conducted by the same microarray platform of GPL6244 Affymetrix Human Gene 
1.0 ST Array (transcript [gene] version). Meanwhile, the remaining experiments using 
endometrium, decidua, and the first-trimester placenta were conducted by the other platform, which 
was GPL570 Affymetrix Human Genome U133 Plus 2.0 Array. Therefore, we identified DEGs 
separately for each tissue that used the same platform (Figure 2A). 
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Several pathogenesis models for PE subtypes will be developed in the extension work of this 
preliminary study according to the discovery sets without merging experiments by different 
platforms. To get comparable gene expression between discovery and validation sets after 
determining the DEGs and before developing the models, we normalized the validation sets 
according to the quantile distribution of the discovery controls, as previously described [22]. 
Therefore, the expression values are centered to those of the discovery controls. To ensure the 
comparable expression is achieved, we conducted principal component analysis. The samples were 
not separated among the experiment groups (Figure S1); thus, we could use the validation sets. 

For the downstream analysis, the transcripts were summarized into genes. The raw expression 
data were combined according to the group definition. Outliers were estimated according to relative 
log expression before normalization and hierarchical clustering of sample-to-sample distances after 
normalization [23]. After removing outliers, the raw expression data were background-corrected 
and normalized using robust multi-array average algorithm. Quality control was conducted by data 
visualization using boxplot, quantile-to-quantile plot, and the MA plot, and confounder 
identification by surrogate variable analysis. 


2.2. Group definition 


PE is asyndrome characterized by both chronic/gestational hypertension and gestational proteinuria. 
PE subtypes in these datasets were all possible combinations based on four axes (Figures | and 2B): 
(1) PIH; (2) PDDs; (3) onset; and (4) severity. By PIH, there were two PE subtypes: (1) PE (..e., 
gestational hypertension with gestational proteinuria); and (2) superimposed PE (i.e., chronic 
hypertension with gestational proteinuria). By PDDs, there were PE subtypes without and with FGR. 
By onset, there were two PE subtypes: (1) early onset (<34 weeks’ gestation); and (2) late onset 
(=34 weeks’ gestation). By severity, there were two PE subtypes: (1) mild-to-moderate (i.e., systolic 
and diastolic blood pressures [SBP/DBP] of 140/90 to <160/<110 mm Hg with proteinuria of 300 
to 2000 mg/24 h, and HELLP negative); and (2) severe (i.e., SBP/DBP of >160/110 mm Hg with 
proteinuria of >2000 mg/24 h, or HELLP positive). As comparators, we also included either 
chronic/gestational hypertension without gestational proteinuria, of which subtypes were also 
defined by PDDs and onset but not severity axes. 

Using the distribution variances of gene expressions from controls in each tissue dataset, power 
analysis was conducted to estimate sample size for differential expression analysis with multiple 
testing by Benjamin-Hochberg false discovery rate (FDR) [24]. We also conducted sample size 
estimation using the intersected genes among the discovery and validation sets for all possible group 
combinations. After considering the sample size estimation, the group combinations and the 
intersected genes were selected for the downstream analysis (Figure 2B). 


2.3. Transcriptome analysis 


Transcriptome analysis was conducted in two stages. In stage 1 (Figure 2B), we used low-level 
information in each dataset with same experiment and platform to identify a gene set by differential 
expression analysis. In stage 2 (Figure 2C), high-level information was used across datasets with 
different experiments and platforms by gene set overlap analysis. 
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Figure 2. The analytical pipeline: (A) Data preprocessing, including quality control, for each dataset; (B) 
Differential expression analysis for discovery sets and data integration for validation sets; (C) Gene-set 
overlap analysis. *, conducted for chorioamnionitis/HELLP in either second or third trimester; FGR, fetal 
growth restriction; HELLP, hemolysis, elevated liver enzymes, and low platelets; hyp.; hypertension; N/A, 
not applicable; PDDs, placenta dysfunction-related diseases; PE, preeclampsia; PIH, pregnancy-induced 
hypertension. 


2.3.1. Low-level information analysis to identify differential expression 


In stage 1, we only conducted a differential expression analysis for each grouping by utilizing 
transcriptomic data from the same tissue type, i.e., placenta, and the same platform (Figures 2A and 
2B). Before differential expression analysis, we filtered out transcripts that were expressed less than 
20" percentile. Transcript expression modelling was conducted. We applied moderated t-statistics 
and multiple testing by Benjamini-Hochberg method. The groups were pairs of the subtypes versus 
control of microarray data from the same platform. Differentially-expressed transcripts were 
selected if the FDR was less than 5%. Up- and downregulated transcripts were determined based on 
positive and negative log2 fold change, respectively. 
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2.3.2. High-level information analysis to identify gene set overlap 


In stage 2, since the experiments were conducted with different platforms between the associated 
factors and the subtypes, to identify association between them we applied association test at high- 
level information by set operation., i.e., gene set overlap analysis. There were two approaches to 
overlap a pair of gene sets before applying the Fisher test (Figure 2C): (1) non-regulation-specific 
overlapping for the region-wise Fisher test; and (2) regulation-specific overlapping for the head-to- 
head Fisher test. Region-wise Fisher test identified whether an overlap between a pair of gene sets 
was statistically significant or simply by chance. This test was computed between DEGs of the 
associated factor with those of the subtype, taking total number of genes of interest into account. 
Meanwhile, head-to-head Fisher test identified whether co- or counter-expression, indicated by 
overlaps between a pair of gene sets, were statistically significant or simply by chance 
(undifferentiated). This test was computed between up- and down regulated DEGs of the associated 
factor with those of the subtype. If the p-value <0.05, the odds ratio (OR) >1 or and <1 concluded 
more overlapped DEGs respectively with co-expression and counter expression compared to the 
opposite regulation. This test determined regulation-specific overlap of interests between the 
associated factor and the subtype. This approach, however, could not identify the causes of the 
subtypes; yet, the goal of this study was to gain insights of the possible causes, particularly those 
related to the pre-pregnancy period and microbial community. For the downstream analysis, we only 
selected DEGs from the significant non-regulation-specific overlap of interest for each of the 
subtypes, regardless the significance of the regulation-specific overlap. 


2.4. Code availability 


We used R 4.2.2. To synchronize all the package versions and their dependencies, we used 
Bioconductor 3.16. All analytical codes are available in https://github.com/herdiantrisufriyana/pec. 


3. Results 


3.1. Sample characteristics 


From the publicly-accessed microarray datasets collected for the associated factors, the phenotype 
characteristics were described (Table S3). Leiomyomata and other non-endometrial conditions were 
found in the endometrium datasets, but this lesion was beyond the tissue of interest. For placenta 
datasets, the first- and second-trimester samples were taken from pregnant women with gestational 
ages of respectively 8.43 + 0 and 16.43 + 0 weeks on average, which would be compared with those 
at 41 + 0 weeks’ gestation on average. Almost all the placenta samples with chorioamnionitis were 
taken from non-preeclamptic pregnant women without HELLP. Meanwhile, all the placenta samples 
with HELLP were taken from preeclamptic pregnant women without chorioamnionitis. 

The phenotype characteristics were also described for the subtypes (Table S4). No 
chorioamnionitis was found for placenta samples from pregnant women with PE without or with 
FGR, regardless of the onset, severity and previous hypertension (i.e., superimposed PE). This 
situation was the same with the controls of any subtypes, i.e., normotensive pregnancy, and other 
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PIH subtypes, except early-onset gestational hypertension. Pregnant women with HELLP were only 
found in the severe subtypes of PE, except superimposed, early-onset, severe PE with FGR. 

We conducted differential expression analysis within each experiment of the same tissue (Table 
S5) with 5118 background genes. They were intersected among the microarray datasets with 
sufficient, gene-wise sample size. Since we did not find a cohort including all the associated factors 
and the subtypes, we could only identify high-level associations by overlapping the gene sets 
determined by different cohorts (Figure 2). The role in the placental dysfunction of the PE subtypes 
was indicated for endometrial maturation but not decidualization (Figure 3), in addition to 
placentation. Opposite gene regulation was also indicated between the PE subtypes and 
chorioamnionitis but not between the PE subtypes and HELLP syndrome (Figure 4; Table S6). We 
also gained insights related to late-onset PE. 


3.2. Role in placental dysfunction of PE by endometrial maturation but not decidualization 


Endometrial maturation, particularly late-secretory phase, showed a potential role in several 
subtypes. Significant overlaps were found between DEGs of late-secretory endometrium and 
placenta under any subtypes with: (1) early onset (p-values <0.008); (2) severe hypertension and 
proteinuria (p-values <0.042); or (3) chronic hypertension and/or severe PE with FGR (p-values 
<0.042). Among these overlaps, placenta under early-onset, gestational hypertension also indicated 
significant counter-expression (OR 0.15, 95% confidence interval [CI] 0.04 to 0.44; p-value 
<0.001). Meanwhile, a significant co-expression was identified if the subtypes fulfilled criteria of: 
(1) early-onset, severe superimposed PE (OR 2.43, 95% CI 1.51 to 3.95; p-value <0.001); or (2) 
early-onset FGR with either severe PE (OR 1.72, 95% CI 1.17 to 2.54; p=0.005) or chronic 
hypertension only (OR 2.85, 95% CI 1.38 to 6.03; p-value 0.003) but not both (i.e., superimposed 
PE; OR 1.15, 95% CI 0.69 to 1.93; p-value >0.05). These findings implied endometrial maturation 
play the putative role by sharing the same up- and down-regulated genes with placenta under those 
subtypes. Meanwhile, the opposite regulation would lead to gestational hypertension alone without 
affecting the early onset. In addition, unlike late-secretory endometrium, the overlap patterns were 
inconclusive between early- and mid-secretory endometrium with the subtypes. 

While decidualization is considered a subsequent process of endometrial maturation, our finding 
did not show its potential role in almost all the subtypes, except late-onset FGR with chronic 
hypertension. A significant overlap was only found between the DEGs of intrauterine, confluent 
decidua and placenta under late-onset FGR with chronic hypertension, in which all the decidual 
DEGs were included in the placental ones (OR œ, 95% CI 2.09 to œ; p-value 0.025). Neither 
significant co- nor counter-expression was identified between both sets of DEGs. Nevertheless, 
since decidua is a pregnancy version of endometrium, we believe this tissue may connect 
endometrial maturation and placentation by a process that cannot be identified by gene expression. 
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Figure 3. Transcriptome analysis by gene-set overlapping between the subtypes and the associated factors 
of: (A) Endometrial maturation; and (B) Decidualization. Significances of undifferentiated, co-, and counter- 
expression are respectively indicated by ~, +, and -. The color gradation represents the number of overlapped 
DEGs for undifferentiated expression and the number of either co- or counter-expressed DEGs. All non- 
grey tiles are significant for the non-regulation-specific overlapping. DEGs, differentially-expressed genes; 
FGR, fetal growth restriction; hyp.; hypertension; PE, preeclampsia; N/A, not applicable. 
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Figure 4. Transcriptome analysis by gene-set overlapping between the subtypes and the associated factors 
of: (A) Placentation; and (B) Comorbidity. Significances of undifferentiated, co-, and counter-expression 
are respectively indicated by ~, +, and -. The color gradation represents the number of overlapped DEGs for 
undifferentiated expression and the number of either co- or counter-expressed DEGs. All non-grey tiles are 
significant for the non-regulation-specific overlapping. DEGs, differentially-expressed genes; FGR, fetal 
growth restriction; HELLP, hemolysis, elevated liver enzymes, and low platelets; hyp.; hypertension; PE, 
preeclampsia; N/A, not applicable. 


Several findings were indeed implying the role of endometrial maturation in placentation. 
Significant overlaps were also found between DEGs of first- and third-trimester placentas 
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respectively under normotensive pregnancy and any subtypes with chronic hypertension and/or PE 
with FGR (p-values <0.035). However, the third-trimester placenta was under neither always early 
onset nor severe hypertension and proteinuria. This exception differed gene expression of first- 
trimester placenta from that of late-secretory endometrium in terms of overlaps with placenta of the 
subtypes. The same overlapping patterns were also applied to second-trimester placenta but there 
was no regulation-specific overlapping. Meanwhile, significant co-expressions (OR lower bounds 
>1.56) were identified in the aforementioned overlaps of first-trimester placenta. The role of 
endometrial maturation in placentation might be related to the impact of chronic hypertension and/or 
PE on fetal growth more than onset and severity. 


3.3. Competing risk of chorioamnionitis and PE with opposite gene regulation 


Furthermore, endometrial maturation implied a putative role in differing PE from chorioamnionitis. 
Late-secretory endometrium and placenta under chorioamnionitis shared the same subtypes whose 
placental DEGs overlapped with their respective ones. However, the gene regulations were 
significantly counter-expressed in majority for chorioamnionitis (n=13/18, 72.22%; OR upper 
bounds <0.21). The remaining overlaps were neither co- nor counter-expressed, except third- 
trimester placentas under early-onset, gestational hypertension. Its DEGs indicated significant co- 
expression with those under chorioamnionitis (OR œ, 95% CI 157.41 to œ; p-value <0.001) but 
counter-expression with those of late-secretory endometrium (OR 0.15, 95% CI 0.04 to 0.44; p- 
value <0.001). These findings implied that PE and chorioamnionitis might have different 
endometrial maturation. 


3.4. Role in HELLP syndrome by endometrial maturation 


Similar to placenta under chorioamnionitis, endometrial maturation also implied a putative role in 
differing PE from HELLP. The similar subtypes were also shared, whose placental DEGs 
overlapped with those of late-secretory endometrium and placenta under HELLP. However, there 
was an exception, i.e., early-onset, gestational hypertension. The significant overlaps of HELLP 
were also only in third- (p-values <0.001) but not second-trimester placenta. This finding implied 
that a HELLP syndrome might be a competing risk of PE if it occurs alone in earlier trimester. The 
role of endometrial maturation in HELLP syndrome might be mediated by PE only. In addition, 
unlike chorioamnionitis, all the gene regulations were significantly co-expressed in the overlaps 
between the third-trimester, placental DEGs of the shared subtypes and HELLP syndrome (OR 
lower bounds >12.79). These findings gain insights in differentiating between HELLP that occurs 
alone and with PE. 


3.5. Insights related to late-onset, severe PE 


Eventually, it is important to point out the difference between any PE subtypes with early and late 
onset. Most findings up to this point were related to any PE subtypes with early onset. While we 
identify a significant overlap between the placental DEGs under late-onset, severe PE without FGR 
and the endometrial ones at mid- (OR 1.7, 95% CI 1.21 to œ; p-value 0.004) and late-secretory 
phases (OR 1.47, 95% CI 1.06 to œ; p-value 0.026), we did not identify any significant overlaps 
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between these DEGs with the placental DEGs under normotensive pregnancy. Meanwhile, 
differential expression analysis identified the placental DEGs for: (1) first- and second-trimester 
versus term placentas; and (2) late-onset, severe PE without FGR versus normotensive pregnancy. 
These findings implied that placentation did not play any role in late-onset PE without FGR but 
endometrial maturation did. However, the significant overlaps were identified if this condition was 
accompanied by FGR (p-values <0.041); which indicated that placentation might still play a role in 
FGR under late-onset PE. These findings implied that impaired placentation after impaired 
endometrial maturation might lead to either an earlier PE or a PE impact on fetal growth. 


4. Discussion 


4.1. Interpretation and comparison to previous works 


Transcriptome analysis indicated that the role in placental dysfunction of chronic hypertension 
and/or severe PE with FGR was potentially played by endometrial maturation but not 
decidualization, particularly gene expression during late-secretory endometrium. The role of 
endometrial maturation via placentation for these subtypes was considered smaller in affecting their 
onset and severity. Meanwhile, only the role of endometrial maturation but none of placentation was 
indicated in late-onset, severe PE without FGR, distinguishing this subtype among others. In 
addition, endometrial maturation was also indicated to play a role in HELLP syndrome as a 
subsequent of PE (e.g., during third trimester) but not as its antecedent (e.g., during second 
trimester). Furthermore, the decision to terminate the pregnancy was likely taken if the preeclamptic 
pregnant women was accompanied by HELLP, since most HELLP placenta were found in 
preeclamptic pregnant women without chorioamnionitis. Therefore, endometrial maturation alone 
in pre-pregnancy period might play a putative role in PE pathogenesis. 

For histologic chorioamnionitis, its gene set also overlapped with those of the chronic 
hypertension and/or severe PE with FGR. It was similar to late-secretory endometrium but the gene 
regulation was the opposite. While endometrial maturation might also play a role in histologic 
chorioamnionitis, its gene expression was regulated in the opposite to those subtypes. This finding 
might be related to the phenotype data, which implied that chorioamnionitis was likely a competing 
risk of PE. Either PE or chorioamnionitis was probably diagnosed earlier, and in turn, this resulted 
in early termination before onset of the other condition. Similarly, the competing risk approach have 
been used for predicting PE with satisfying accuracy [25]. Furthermore, histological changes of 
placenta in chorioamnionitis were only increased fetal capillaries without villous remodeling as 
observed in those of PE, but the changes were more acute [26], probably preceding PE. Hence, the 
gene regulation of endometrial maturation probably determined if a pregnancy would end up with 
either PE or chorioamnionitis by affecting microbial community in endometrium. 

Furthermore, the role of endometrial maturation in pre-pregnancy period was specific to PE 
among other PIH/PDDs, e.g., early-onset, gestational hypertension. It shared similar characteristics 
to chorioamnionitis in term of gene regulations which were the opposites to that of late-secretory 
endometrium. A previous study showed gestational hypertension had the highest risk of acute 
chorioamnionitis (n=29/91, 31.9%; p-value <0.001), compared to the other PIH [27]. Nevertheless, 
what differ early-onset, gestational hypertension from chorioamnionitis is still unclear. 
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4.2. Strength and limitation 


By our workflow, an expensive, time-consuming wet lab experiment could be well-prepared not 
only by literature review but also by a data-driven approach utilizing either at low- or high-level 
data. This preliminary study also demonstrated how off-the-shelf tools might be variably applied to 
answer diverse questions in a similar topic. This kind of secondary data analysis across different 
tissues and timing is inevitable, particularly in pregnancy-related research, because of ethical 
reasons. Similar situations may also be applied other conditions with long interval time in which a 
primary data collection is difficult and expensive. 

Several limitations are considered in this study. Microarray dataset of gene expression may not 
help to reveal all parts of the pathogenesis. Nevertheless, we differentiated the several subtypes of 
PE, and identified the novel, data-driven pathways, using the microarray dataset only. Additional 
information from the next-generation sequencing data may reveal new perspectives to the proposed 
pathogenesis of PE in this study, by including non-coding genes. We also could not apply the results 
directly to develop screening and preventive strategies in clinical settings. This is because we used 
microarray from tissues by invasive sampling, unlike blood sampling or other methods which are 
routinely used in clinical settings. Yet, these give a specific direction for the variables and the study 
design for the next investigation to support the clinical implementation. 


5. Conclusions 


The role in placental dysfunction was potentially played by endometrial maturation, but not 
decidualization, for any subtypes with early onset, severe hypertension and proteinuria, or chronic 
hypertension and/or severe PE with FGR. However, no role of placentation was indicated in late- 
onset, severe PE without FGR. Both phenotype and genotype also implied that histologic 
chorioamnionitis was likely a competing risk of PE, in which the gene regulation of endometrial 
maturation might affect surrounding microbial community to determine if a pregnancy ends up with 
either PE or chorioamnionitis. In addition, our preliminary results showed the feasibility of 
developing and validating pathogenesis models of PE subtypes which will be the focus of the 
extension work of this preliminary study. Eventually, this study will help to decide if future studies 
need prospective, pre-pregnancy and chorioamnionitis data for preeclampsia. 
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The problem of microdissection of heterogeneous tissue samples is of great interest for both 
fundamental biology and biomedical research. Until now, microdissection in the form of 
supervised deconvolution of mixed sequencing samples has been limited to assays measuring 
gene expression (RNA-seq) or chromatin accessibility (ATAC-seq). We present here the first 
attempt at solving the supervised deconvolution problem for run-on nascent sequencing data 
(GRO-seq and PRO-seq), a readout of active transcription. Then, we develop a novel filter- 
ing method suited to the mixed set of promoter and enhancer regions provided by nascent 
sequencing, and apply best-practice standards from the RNA-seq literature, using in-silico 
mixtures of cells. Using these methods, we find that enhancer RNAs are highly informative 
features for supervised deconvolution. In most cases, simple deconvolution methods perform 
better than more complex ones for solving the nascent deconvolution problem. Furthermore, 
undifferentiated cell types confound deconvolution of nascent sequencing data, likely as a 
consequence of transcriptional activity over the highly open chromatin regions of undiffer- 
entiated cell types. Our results suggest that while the problem of nascent deconvolution 
is generally tractable, stronger approaches integrating other sequencing protocols may be 
required to solve mixtures containing undifferentiated celltypes. 
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1. Introduction 


One key problem of interest when studying transcription is the ability to capture the het- 
erogeneity that exists in true biological samples.! Bulk sequencing samples from cells are an 
aggregate across a cellular population, and thus average out differences between individual 
cells to capture only an ensemble profile of a given sample. Notably in the case of samples 
taken from tissues composed of heterogeneous constituent cells, any celltype specific differences 
are not necessarily discernible in the heterogeneous mixture of expression data. 

To some extent, this problem has been at least partially solved in the context of RNA-seq 
with the emergence of single cell RNA-seq protocols which allow for RNA content at the level 
of the individual cell to be measured.” However, the relatively high cost of sampling deeply 
limits the use of seRNA-seq in many contexts. Consequently, a great deal of work has been 
done to separate samples into constituent cell types in silico. This task is interchangeably 
referred to as deconvolution or microdissection. Deconvolution has been studied extensively in 
the context of both microarray data and in RNA-seq,!? © but has seen only limited application 
to other high throughput genomic data. 

Nascent transcription protocols’® are of particular interest for studies into transcriptional 
regulation. !° Nascent sequencing protocols profile active RNA Polymerase II activity, which 
captures enhancer associated RNAs (eRNAs), short unstable transcripts that are often as- 
sociated with transcription factor binding sites.!' These eRNA transcript have proven to be 
highly informative markers of transcription factor activity.?!°!2 1° Unfortunately RNA-seq, 
whether bulk or single cell, does not capture enhancer associated transcripts due to the fact 
they are unstable and not polyadenylated." For this reason, the theoretical possibility of single 
cell measures of nascent transcription has tremendous potential for understanding regulation 
and transcription factor activity in key biological processes including development and disease 
progression. 

Today, nascent sequencing protocols still operate only on the bulk level, largely because 
nascent protocols are relatively onerous, taking up to a week to process a set of samples.”>!7 
Because nascent protocols capture RNA production, many of the signals arise from lowly 
abundant, highly unstable RNAs.!! Furthermore, with current biochemical efficiencies, a single 
cell nascent sequencing protocol is likely infeasible, and thus deconvolution is needed to dissect 
nascent transcription profiles within tissues. 

Nascent transcription data has relatively unique properties compared to RNA-seq. First, 
RNA-seq measures steady state mature, stable RNA levels which tend to be of relatively high 
abundance. In contrast, nascent sequencing protocols cover a much larger proportion of the 
genome (~ 40% as opposed to ~ 8%).1” The consequence is that the average sequencing depth 
per transcript is typically lower in nascent data, in spite of often sequencing samples to a 
higher depth. Second, many transcripts measured in nascent protocols are unannotated, lowly 
transcribed, unstable eRNAs (Figure 1J\).1417 In development, enhancer activities are the first 
changes detectable when a cell undergoes state change, suggesting their associated eRNAs 
have high potential as cell type markers.!® Furthermore, enhancer associated RNAs tend to be 
more cell type specific than protein coding genes.! However, their low transcription levels lead 
to issues of reliable detection.'’ Thus methods developed for RNA-seq must be appropriately 
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adapted to use with nascent sequencing data. 

Here, we use standardized methods for supervised deconvolution to nascent sequencing data, 
applying a newly developed filtering technique to solve problems presented by nascent data 
in the deconvolution context. We show that deconvolution of nascent sequencing data works 
reliably, albeit with different model performance than in RNA-seq. We find that eRNAs present 
an informative set of information for deconvolution that can be inferred without a reference 
annotation. Furthermore, we find that undifferentiated celltypes confound deconvolution of 
nascent sequencing data, likely because their transcriptional expression resembles that of an 
aggregate of different differentiated celltypes. 


2. Results 


The problem of supervised deconvolution with sequencing data is formulated as follows: Given 
sequencing samples from homogenous cell types and a heterogenous sample made up of those cell 
types, can we estimate the mixing proportions of those constituent cell types? The problem of 


supervised (or partial) deconvolution is typically formulated as a linear system (Equation 1).5:20 
X= AS (1) 


Here, X is a single-row matrix with one column per region of interest (ROI) (1 x g), Aisa 
single row matrix with one column per reference homogenous cell type (1x s), and S is a matrix 
with one row per sample and one column per ROI (s x g). In most contexts, regions of interest 
(ROIs) correspond to annotated genes. 

This is a overdetermined linear system, since the number of ROIs far exceeds the number 
of constituent cell types. Additionally, because these are biological values sampled from a 
noisy process, the key challenge is minimizing errors when solving the system. Most work 
in the literature has sought to solve the issues of this system in the context of RNA-seq or 
microarray! ©7022 data, with limited applications of this approach to other kinds of sequencing 
data. 

For RNA-seq, a large variety of tools and approaches have been developed,!?°°:620-22 
which approach the problem using different models, constraints, and regularization approaches, 
as well as different ways to shrink the linear system. Many of these approaches claim to 
be the state-of-the-art, with most tools providing good performance. Consequently, we first 
examine the deconvolution problem on nascent sequencing using annotated genes and methods 
developed for RNA-seq. 


2.1. Deconvolution on annotated genes 


To evaluate existing deconvolution methods on nascent sequencing data, we first identified a 
number of high quality nascent sequencing data sets from a variety of cell types (see Table 
[2. 1p. Samples were processed using a standardized analysis pipeline?’ which includes quality 
control, mapping and bidirectional transcript identification. These bidirectional transcripts 
originate from both gene start sites and regulatory elements such as enhancers (Figure MA). 
The non-gene associated bidirectionals are often referred to as enhancer associated RNAs, or 
eRNAs. 
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As a first test, we examined only annotated protein coding genes to mimic deconvolution 
analysis typically done in RNA-seq. Notably, nascent data differs from RN A-seq in that splicing 
information is not present in nascent sequencing experiments, as RNA is collected pre-splicing. 
Furthermore, consistent with standards in nascent transcription analysis,!” we exclude the 
+300 initiation region of each gene when using featureCounts* to count reads (see Figure IA), 
as this avoids the 5’ bidirectional peak. 

To simulate a mixed sample, we generated 128 randomly mixed samples by subsampling 
reads from each reference sample. Samples used for all in-silico experiments in this paper were 
mixed proportionally from raw reads using samtools,”° and are listed in Table [2.1] With these 
randomly mixed samples, we then performed supervised deconvolution using 4 different methods 
which are commonly discussed in the literature — Nonnegative-Least Squares Regression 
(NNLS), Ridge Regression, LASSO Regression, and eSupport Vector Regression (SVR). For 
all methods tested, we apply a nonnegativity constraint (all mixing proportions must be 
at least zero) and a sum-to-one constraint (all mixing proportions must sum to one), as 
suggested in prior work.! These constraints serve to make results from various deconvolution 
procedures interpretable as mixing weights for the linear deconvolution system. Code and 


supplemental materials for this project are available at https://github.com/Dowell-Lab 
We find that these methods provide generally good accuracy on 


Study GEO Accession SRR Cell Type 
Samples used in Figure 
Jiang 201876 GSM3025555 SRR6789175 HCT116 
Fei 2018?" GSM3100195 SRR7010982 HeLA 
Andrysik 201778 GSM2296635 SRR4090102 MCF7 
Dukler 201779 GSM2545324 SRR5364303 K562 
Zhao 201630 GSM2212033 SRR3713700 Kasumi-1 
Danko 20183! GSM3021718 SRR6780907 CD4+-T-cell 
Chu 20183? GSM3309955 SRR7616132  Jurkat-T-cell 
Samples added for 
Core 201433 GSM1480326 SRR1552485 GM12878 
Smith 20214 GSM4214080 SRR10669536 ESC 
Ikegami 202035 GSM4207079 SRR10601203 BJ5ta 


Table 2.1] Samples used in this study. 


deconvolution on our 128 randomly generated mixtures, although it appears that regularized 
methods perform more poorly than naive NNLS (Figure 1B,C) in certain celltypes across these 
mixtures. In this context, it appears that regularization does not improve accuracy at the cost 
of significant computational slowdowns relative to NNLS. Given these promising initial results, 
we next sought to shift the focus away from annotated genes to the unannotated bidirectional 
transcripts present at both promoters and enhancers. 
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Figure 1. A: Nascent transcripts accumulate in a known bidirectional pattern around promoter sites 
as well as at enhancers.”°° These bidirectional regions are counted by convention around +300bp from 
the site of RNA Polymerase initiation (roughly the center of the bidirectional).91°°° For annotated 
genes, we exclude the initiation peak by counting +300 to the annotated transcription end site. 
B: Deconvolution was performed on random mixtures of cells from Table Some 
celltypes show highly accurate estimation of mixing proportion when doing deconvolution over all 
annotated genes, with most methods showing good linearity in their estimation. C: Other celltypes 
confound the regularized models used here, suggesting a systematic failure of regularization for proper 
estimation mixing proportion in this naive analysis. This failure appears to be more pronounced with 
L2 regularized methods and appears in all analyses conducted in this work, to some extent. 


2.2. Identifying bidirectionals as regions of interest 


In addition to transcription at annotated genes, nascent transcription data contains bidirec- 
tional transcription at both promoters and regulatory elements. While annotated genes are 
widely studied and the typical target for this class of deconvolution algorithms, the study of 
enhancer associated RNAs is important for understanding the regulatory landscape of the cell. 
Various methods exist to identify sites of bidirectional transcription®® °° and to combine them 
across different samples.!° As such, bidirectionals are an additional region of interest that we 
now consider in our deconvolution framework. 

To this end, we use a combined set of 485,688 bidirectionals, identified by Tfit and dReg 
within the Nascent-flow framework, capturing both enhancer RNAs and promoter regions, for 
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all samples in Table [2.13638 Notably, this system is significantly larger than the set of protein 
coding genes (approximately 490,000 vs 20,000). In this work, we use the following terminology 
in reference to subsets of this system — Bidirectionals refers to any site of RNA polymerase 
II initiation and generally includes both promoters and enhancers; any bidirectionals whose 5’ 
end (+/-300bp annotated TSS) overlaps an annotated 5’ gene in the RefSeq hg38 annotation 
is called a promoter; all other bidirectionals are called enhancers. Given the large size of this 
system, we next turn our attention to filtering the set of bidirectionals, to shrink the size of 
the overdetermined system to make deconvolution more computationally feasible. 


2.3. Filtering methods are useful for shrinking the system 


In traditional deconvolution contexts like microarray and RNA-seq, patterns of differential 
expression are often leveraged to shrink the system. For example, CIBERSORT® uses an 
adaptive filtering method based on DESeq2 to find genes most indicative of specific celltypes. 
In the context of nascent sequencing data, however, tools like DEseq2 are problematic. The 
relatively low read coverage and cell type specificity of bidirectionals (e.g. inherent variability) 
leads DESeq?2 to distrust these regions. To counter this, we developed a naive filtering scheme, 
selecting a fixed number of ROIs defined by the user for each homogenous reference sample 
where the reads for that sample were most different compared to all other samples. More 
formally, we define an algorithm for pruning the system of ROIs to a tractable level: 


e Filter all ROIs to restrict them to regions where all celltypes have counts lower than 
the 99th percentile of reads in the sample. We do this to remove outliers whose extreme 
values could break the assumptions of a linear system. 

e Generate transformed ratio T such that for each ROI (row), for each celltype (column), 
that entry is the log2 ratio of the count at that ROI over the maximum count for that 
ROI not in that celltype. This step generates a log2 transformed list of the ROIs that 
are the most specific to a single celltype. 

e Order this list by the largest log2 ratio in any celltype in any ROI. Then, walk down this 
list keeping ROIs such that the number of ROIs for each celltype is approximately equal, 
up to some limit of elements. This generates a subset of the full system with the most 
celltype specific elements for each cell. The number of ROIs is approximate because the 
number of celltype specific elements varies per-celltype and can be exhausted at larger 
system sizes. 


2.4. Most linear methods perform with high accuracy on synthetic nascent 
data 


Given that bidirectional regions have distinct transcription characteristics compared to more 
robustly transcribed annotated genes, we first sought to assess deconvolution methods on the 
filtered bidirectional set. Using this set, we find that deconvolution achieves a high degree 
of accuracy (Figure 21). Unexpectedly, we observe that across all sizes of system tested 
(including systems far in excess of the total number of genes in the human genome), non- 
negative least squares (NNLS) regression performs with the highest degree of accuracy. LASSO 
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(L1 regularized linear regression) has a close second in performance. This is likely because 
LASSO regularization will only drop out cell types that are unlikely to be present in the mixture. 
In contrast, Ridge Regression (L2 regularized linear regression) performs worse than all other 
tested methods for most system sizes. Similarly, «Support Vector Regression (e--SVR) with L2 
regularization also performs relatively poorly compared to NNLS, but relatively well compared 
to Ridge regression. Despite these differences in accuracy, all models perform reasonably well 
on our synthetic mixtures, achieving accuracy to within a few percent on randomized mixtures. 
This is notable because these deconvolution methods perform well both on systems much 
smaller and much larger than those typically used for deconvolution of RNA-seq data. 

Interestingly, we find our subsetting method consistently selects a mixture of enhancers and 
promoters that does not significantly differ from the distribution expected by random chance 
(Figure 2B). Consequently, this procedure captures mostly eRNAs and not promoters, since 
the number of eRNAs far outnumbers the number of promoters. This suggests that certain 
enhancer-driven regulatory elements are highly informative in identifying celltype. 

We next sought to determine which ROIs were most informative to the deconvolution 
problem. To answer this question, we utilized NNLS, the best performing method in our prior 
tests. Using NNLS, we compared the performance on bidirectionals (as in Figure RA), to 
annotated genes (as in Figure [1B,C) and a combination of these features — selected using 
our region filtering approach (Figure |3). We find that these methods achieve high accuracy 
for both genes and bidirectionals across a number of system sizes, with somewhat reduced 
accuracy when combining these two sets of ROIs. This reduction in accuracy could be a result 
of colinearity in the combined set of ROIs, as some bidirectionals may be intronic and thus 
they are not a strictly non-overlapping set relative to annotated genes. 

For the data tested and the size of system used, we found that certain methods in the 
literature were prohibitively slow for the large linear systems we tested. For example, a v support 
vector regression (v-SVR) approach as suggested by CIBERSORT* was too computationally 
expensive to test or benchmark reliably, taking more than 24 hours to do deconvolution on a 
single mixture of cells at large system sizes (approximately 100k ROIs or more). Due to these 
poor scaling characteristics, we instead chose to use an optimized implementation of the primal 
version of «SVR. This was chosen instead of a dual formulation to maintain computational 
tractability for the large number of samples relative to the number of features. In the context 
of nascent sequencing data, NNLS is likely the best model to use based on our benchmarking. 


2.5. Undifferentiated celltypes confound deconvolution of mixtures 


In the course of testing our model, we observed that certain celltypes strongly confounded all 
deconvolution models tested when using bidirectionals. To understand this puzzling behavior, 
we examined deconvolution in the presence and absence of these cell types. To do deconvolution 
of this system, we generated a titration curve, mixing celltypes from distinct separate mixing 
proportions into equivalent proportions for all celltypes. 

We observed that both ESC cell lines and BJ5TA cell lines caused deconvolution to fail 
(Figure 4A,B). Specifically, inclusion of either cell line results in an overestimation of the 
mixing proportion for those cell types. We carefully examined these two cell lines to identify 
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Figure 2. A: Models were tested using standard library implementations on a set of 128 randomly 
generated sets of mixing parameters. Each model was tested on 100 different subsets of ROIs selecting 
10"-many points for n € [1,5] using linear spacing between subsequent n. Most models perform well 
in the intermediate region of 10°-10* points selected per-sample, but diverge outside of that regime. 
For each set of ROIs selected, the same 128 randomly generated sets of mixing parameters were 
used as in We observe that for essentially all points, NNLS outperforms more complex 
models. B: To understand the selection process of our subsetting algorithm, we tested whether 
enhancers were selected from the full ROI set at a greater rate than would be expected by random. 
To do so, we performed a hypergeometric test with Bonferroni correction over all trials of our ROI 
subsets. We observe that for smaller system sizes the enhancer/promoter sampling ratio does not differ 
dramatically from that expected by random sampling. When the system size increases, enhancers 
become preferentially selected over promoters (p < 0.05), but this increase in the rate of enhancer 
selection does not correlate with the accuracy of any model. 


distinguishing features relative to the other cell lines. 

To determine whether the number of cell lines or cell line immortalization differences could 
be the source of the problem, we added lymphoblastoid cell lines immortalized (LCL) by EBV. 
Notably, LCLs do not confound the model and show excellent performance (Figure ME). Both 
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Figure 3. To compare the maximum theoretical accuracy of our system, we conducted the same 
analysis as in [Figure 2] using either the region sets of bidirectionals, annotated genes, or a combination 
of the two, performing the same subsetting procedure as before. We observe that at smaller region 
sizes using genes alone provides a higher degree of accuracy than just bidirectionals, but that at 
larger sets of ROIs bidirectionals alone can achieve a higher absolute degree of accuracy. Somewhat 
unexpectedly, the combination of both sets of regions performs more poorly than each separate subset. 
Note that as system size increases, the accuracy of the set using annotated genes reaches a constant 
level purely because the total size of that system is exhausted by virtue of being an order of magnitude 
smaller than that of the bidirectionals or combined set. 


ESC (embryonic stem cells) and BJ5TA (fibroblast derived) are non-terminally differentiated 
and non-oncogenic (Figure 4p). Furthermore, we see that even without regularization, NNLS 
successfully removes non-present celltypes (Figure 4|A-C), meaning that undifferentiated cell- 
types will not be inferred in the mixing proportion if they are not present at all in the mixture. 
Furthermore, regularization techniques are not required to accomplish this removal of celltypes 
that are absent. 

One alternative hypothesis to the source of this problem is that heterogeneity in the 
population of undifferentiated celltypes is the source. However, this would suggest that more 
heterogeneous cell populations should perform worse in deconvolution, as should cells from 
similar tissue types. Yet based on this data, this seems unlikely, given that both CD4+ and 
Jurkat cells, both peripheral blood mononuclear cells (PBMC) derived, are present in the 
mixture and are successfully estimated by our models. Since the addition of a lymphoblast cell 
line immortalized using EBV (GM12878) does not result in system failure in the same way 
that is observed with the non-differentiated cell-lines, we suspect that differentiation is the key 
issue here as opposed to heterogeneity. Our work suggests that undifferentiated or partially 
differentiated cell types pose a key challenge to the deconvolution of nascent sequencing data 
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Celltypes with strong potential for differentiation confound deconvolution 
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Figure 4. To interrogate the effect of undifferentiated and partially differentiated celltypes on the 
performance of deconvolution, we performed a titration experiment, estimating mixing parameters 
for 100 different mixtures of celltypes as mixing proportions were taken from maximally separated 
to equivalent. For each trial n, the mixing proportions are equally spaced points in n+, 1] that are 
then rescaled to sum to one. Each subset (A,B,C) was generated by holding out one celltype from 
the full mixture and renormalizing the adjusted mixing proportions to sum to one. A,B: Adding 
either BJ5TA or ESC cells into the mixture causes a higher-than-true proportion of those cells to be 
estimated. Neither of these cell lines are terminally differentiated. C: Addition of EBV immortalized 
LCL cells into the mixture does not result in failure of the deconvolution model, suggesting that the 
observed failures are not a function of how cells were immortalized. D: To understand if this failure 
could be attributed to celltype specificity, we calculated the mean Jensen-Shannon Divergence for 
each sample compared to all others. The pluripotent ESC cells show the lowest celltype specificity 
while the partially differentiated BJ5TA cells show the highest celltype specificity, with the exception 
of HeLa cells. 
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when using enhancers because their regulatory profile, particularly that of their enhancer 
regions, resemble an ensemble profile of multiple differentiated celltypes. In support of this, 
the problem does not seem to occur when using genes alone, suggesting that undifferentiated 
cells may lack the same level of specificity at bidirectionals as terminally differentiated cell 
types. 

Our results suggest either very low or very high celltype specificity when looking at these 
samples’ bidirectional ROIs (Figure 4D). When looking at the mean Jensen Shannon Diver- 
gence for each sample compared to all others, we observe that our undifferentiated cell lines 
are either the least specific (ESC) or the most specific (BJ5TA). Although HeLa cells show 
the highest degree of celltype specificity by this measure, HeLa cells are not representative 
of human cells, exhibiting notably different expression patterns*! which would lead to a high 
degree of cell type specificity. Past work has shown that ESC cell lines have genome-wide 
transcriptional hyperactivity’? that narrows as differentiation progresses. Additionally, work 
in hematopoetic cells has suggested that these undifferentiated cell lines are characterized by 
a high degree of fluidity in chromatin modification.*? More work is required to definitively 
establish that differentiation is the source of the breakdown of deconvolution in this system, 
and will likely require significant work outside the scope of this preliminary study. 


3. Conclusion 


This work is the first to examine supervised deconvolution of heterogenous mixtures of nascent 
sequencing data. Deconvolution is an essential tool for the study of heterogenous samples, 
whether cell lines or tissues. While most work on deconvolution of heterogenous samples has 
moved on to focusing on single cell protocols, a single cell nascent sequencing protocol currently 
seems infeasible. Thus, nascent sequencing is limited to bulk experiments, which appear to be 
reliably separable by supervised deconvolution. We present here the use of nascent sequencing 
data as a testbed for this supervised deconvolution problem. We integrate best practices from 
the literature and develop new techniques to handle characteristics in nascent sequencing data 
where assumptions from the RNA-seq deconvolution literature do not hold. 

To benchmark various deconvolution algorithms, we first developed a new algorithm to 
filter ROIs to only use regions with the most celltype specific expression. We find that this 
selection process does not preferentially select enhancer or promoter ROIs. That said, the 
number of enhancer associated bidirectionals far exceeds annotated genes, providing ample 
features from which to select regions of interest. Our proposed algorithm is simple, fast, and 
reliable, and establishes a strong first basis for the development of more specific ROI filtering 
tools for nascent deconvolution. 

Using this algorithm, we compared standard methods used for solving the deconvolution 
problem. Specifically, we tested NNLS, Ridge, LASSO, as well as «SVR. We found that all 
methods reliably separate the nascent deconvolution system, with L2-regularized methods 
achieving comparatively poor performance to NNLS. Furthermore, we found that even a 
simple method like NNLS could reliably eliminate celltypes that were not present in the sample, 
suggesting regularization is not necessary for solving the deconvolution problem here. While 
we find that both annotated genes and bidirectionals can achieve high accuracy in supervised 
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deconvolution (with bidirectionals having an edge in absolute accuracy), it is worth emphasizing 
that bidirectionals are distinctly advantageous in that they are annotation-independent and 
discovered de-novo for each sample. 

We show that the addition of undifferentiated samples to a nascent deconvolution system 
results in highly skewed mixing estimates, with undifferentiated celltypes predicted as far more 
likely than their actual frequency in the mixture. One possible reason for this is that undif- 
ferentated celltypes tend to show regulatory patterns akin to a combination of the regulatory 
patterns of each constituent celltype. It appears to be a necessary condition for some amount 
of the undifferentiated celltype to be present in the mixture in order for the system to fail. 

One key issue in this work is the lack of availability of diverse high quality nascent sequencing 
data to perform simulations against. Although a large amount of nascent sequencing data 
is available and published, the number of cell types available is somewhat limited. Protocols 
aimed at extending run-on sequencing to a broader base of samples, such as ChRO-seq** show 
promise in alleviating this bottleneck. Importantly, many of the earliest nascent data sets 
lacked replicates — which excluded their usage here. Data quality and availability is often a 
limiting factor in computational studies, and this work is not an exception to that rule. 

In this work, and generally for the supervised deconvolution problem, we assume that all 
cells in a sample are taken from an approximately homogeneous population. This is sometimes 
a reasonable assumption but is often not. One future frontier that could be highly beneficial 
to this project is the incorporation of single cell ATAC-seq (scATAC) as a secondary source 
of information to augment bulk nascent sequencing data. scATAC combines the chromatin 
accessibility readout provided by ATAC-seq (indicative of regions open to transcription) with 
the cell-specific information provided by modern single cell sequencing protocols. Tools are 
already well defined for clustering single cell sequencing data into constituent cell types, as 
individual cells can typically be separated using dimensionality reduction methods like PCA, 
tSNE, or UMAP.*°-4° Because transcription occurs in regions of open chromatin, which is 
what ATAC-seq measures, mixing fractions and celltype specific transcripts could be estimated 
more reliably using combined data from both protocols. Future work combining pairing single 
cell ATAC-seq data and nascent sequencing data could leverage techniques used by existing 
tools?! to do deconvolution on a more granular level for individual samples, providing a strong 
complementary tool to the bulk deconvolution discussed here. While single cell approaches 
remain comparatively expensive, this combination would be a powerful tool for looking at 
transcriptional regulatory networks at the level of sub populations of samples. 

Nascent sequencing is a powerful tool for the assessment of transcriptional regulatory 
networks, and when paired with deconvolution tools will also facilitate deeper understanding 
of those regulatory networks in heterogeneous cell populations. Leveraging a transcription 
oriented sequencing approach instead of an expression oriented (e.g. steady state) one provides 
myriad benefits — more thorough coverage of the genome, understanding of regulatory elements, 
and a deep view of underlying transcriptional dynamics — all of which can be integrated with 
different sequencing protocols to great effect. Supervised deconvolution represents an important 
preliminary foothold into this space, and this work shows that nascent sequencing data is well 
suited for that class of problems. 
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Splitpea: quantifying protein interaction network rewiring changes due to 
alternative splicing in cancer 
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Protein-protein interactions play an essential role in nearly all biological processes, and it 
has become increasingly clear that in order to better understand the fundamental processes 
that underlie disease, we must develop a strong understanding of both their context speci- 
ficity (e.g., tissue-specificity) as well as their dynamic nature (e.g., how they respond to 
environmental changes). While network-based approaches have found much initial success 
in the application of protein-protein interactions (PPIs) towards systems-level explorations 
of biology, they often overlook the fact that large numbers of proteins undergo alternative 
splicing. Alternative splicing has not only been shown to diversify protein function through 
the generation of multiple protein isoforms, but also remodel PPIs and affect a wide range 
diseases, including cancer. Isoform-specific interactions are not well characterized, so we 
develop a computational approach that uses domain-domain interactions in concert with 
differential exon usage data from The Cancer Genome Atlas (TCGA) and the Genotype- 
Tissue Expression project (GTEx). Using this approach, we can characterize PPIs likely 
disrupted or possibly even increased due to splicing events for individual TCGA cancer 
patient samples relative to a matched GTEx normal tissue background. 


Keywords: alternative splicing; protein-protein interaction networks; protein network 
rewiring 


1. Introduction 


Alternative splicing is a crucial mechanism that underlies the increased complexity of higher 
eukaryotes. It is now estimated that ~95% of human genes!” undergo splicing changes, and the 
increase in protein diversity that results from splicing has been put forth as one of the primary 
explanations for the apparent mismatch between species complexity and their genome size.*4 
Importantly, alternative isoforms of the same gene can exhibit highly different interaction 
profiles and thus affect the dynamics of protein interaction networks.° Splicing has been shown 
to be a key regulator of tissue specificity (especially in the brain),? and dysregulation has 
been increasingly implicated in a wide array of diseases,’ from cancer®? to neurodegenerative 
diseases.‘° Thus, it is critical to understand the changes in protein interactions due to splicing 
that underlie cellular function and dysfunction. 

However, a systematic study of splicing-related protein network dynamics is hampered 
by multiple challenges. Although emergent experimental approaches to directly screen for 
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isoform-level protein-protein interactions are promising,® they are very early in development 
and highly restricted in resolution. Furthermore, all such screens are naturally bounded by 
not only a combination of technical and cost constraints, but also the inherent complexity of 
the underlying networks and the vast number of potential cell types and conditions of interest. 
Fortunately, the now standard use of RNA-sequencing provides a window into the exploration 
of splicing patterns across varied conditions. While RNA-seq data alone is still insufficient 
to chart out the entirety of any particular splicing interaction network, it can be used to 
understand condition-specific splicing dynamics. 

Here, we present Splitpea (SPLicing InTeractions PErsonAlized), a method for detecting 
sample-specific PPI network rewiring events. Splitpea takes advantage of the key insight that 
splicing can disrupt critical protein domains that mediate PPIs through domain-domain in- 
teractions (DDIs), which have been derived based on a mix of structural, evolutionary, and 
computational approaches.'! 4 Splitpea integrates PPI and DDI information with sample- 
specific differential splicing events, and can be used easily in concert with existing, established 
computational approaches for the identification and quantification of differential splicing.!° In 
the scenario where only an individual sample is available or a different background context is 
preferable (versus existing control samples), Splitpea provides functionality to use a separate 
reference database of background splice events; for example, one can choose to use normal 
GTEx data as background for individual TCGA cancer samples (matched by tissue type). 
Furthermore, as part of Splitpea’s characterization of the potential downstream interaction 
network changes, Splitpea indicates likely direction: gain, loss, or chaos (mixed / unclear). 

Thus, to our knowledge, Splitpea is the first general tool to characterize potential direction 
of protein interaction rewiring due to splicing for individual samples. We demonstrate the util- 
ity of Splitpea on breast and pancreatic cancer samples from TCGA, using matched normal 
tissue samples (breast and pancreas) from GTEx. All source code for Splitpea and the corre- 
sponding analyses are available via Github (https: //github.com/ylaboratory/splitpea), 
with additional links to download all data and associated networks. 


1.1. Prior work 


Prior work considering domain-domain interactions in the context of splicing have mostly 
focused on query-based or visualization interfaces. Many consider interactions at the isoform 
level, aiming to provide a context-specific isoform interaction graph.!®'® There has been rel- 
atively less work focusing on characterizing network rewiring events. Recently, the first tool 
to characterize the mechanistic effects of splicing on downstream PPIs was proposed,!? but 
this tool is unable to differentiate between the potential directionality of interaction rewiring 
(likely gain or loss events). Specifically for the study of cancer, there has also been large-scale 
analysis efforts to characterize the impact of splicing on PPIs across patients. Though this 
work was not patient-specific, it provided strong evidence to demonstrate that there exists a 
large catalog of isoform changes (with potential downstream impacts on PPIs and regulatory 
networks) that exist independently of expression changes in cancer. Beyond using PPI net- 
works, there have also been exciting efforts integrating cancer RNA-seq together with somatic 
mutation data and using functional networks to interpret the downstream impact of splicing.?° 
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2. Methods 
2.1. Protein interaction and domain interaction data 


Human protein-protein interactions were downloaded from BioGRID (v4.4.207),?! DIP (2017- 
02-05),?? HIPPIE (v2.2), HPRD (Release 9),24 Human Interactome (HI-II),?° IntAct (2022- 
04-18),?° iRefIndex (v18.0),?7 and MIPS (Nov 2014).?8 All proteins were mapped to Entrez 
Gene IDs.”° 

Known and predicted domain-domain interactions were downloaded from 3did 
(v2017_06),11 DOMINE (v2.0),!2 IDDI (2011.05.16),!° and iPFAM (v1.0).!4 For predicted 
DDIs, only interactions with confidence > 0.5 were used in downstream analyses. 

Protein domain locations were translated to genomic locations using the Ensembl BioMart 
API and the biomaRt R package% and indexed using tabix?! to facilitate fast retrieval given 
a set of genomic coordinates. 


2.2. Tissue and tumor splicing data processing 


Spliced exon values in the form of percent spliced in (PSI or y) were obtained for both normal 
pancreas and breast tissue samples from the Genotype-Tissue Expression (GTEx) project and 
pancreatic cancer and breast cancer samples from The Cancer Genome Atlas (TCGA) using 
the IRIS database.*? IRIS uses rMATS*? to tabulate y values for skipped exon events (the 
most abundant splicing event). Though we use rMATS 7 values in this study, Splitpea is 
agnostic to the choice of upstream differential splicing analysis tool and can easily be applied 
in concert with other tools that use a form of w as their quantification metric.34-37 

Specifically, we delineate y; as the y value for exon i = 1,...,nz, where there are ng total 
exons that had a reported exon skipping event. Note that the precise exons captured in the 
sample of interest and the background samples are typically non-identical. We are only able to 
estimate 7 for exons that are captured in both, and thus, ng represents the number of exons 
that lie at the intersection of the two larger sets of exons. In the scenario where a background 
reference distribution of 7 values are provided, we calculate Ay; as the following: 


: 5 1# 
AP =o) -D (1) 
b=1 


where yp? is the w for exon i in our sample of interest s (e.g., a cancerous pancreatic sample 
from TCGA), while pe?) is the ~ for the same exon 7 in an individual background sample b 
(e.g., anormal pancreatic sample from GTEx), and ng is total number of background samples. 
Intuitively, larger ng will provide better estimates of the background distribution, especially 
if there is large variability in splicing patterns. We recommend assembling backgrounds with 
at least ng > 30 for the empirical cumulative density function estimate below. 

w values lie in the range [0,1]; thus Aw € [-1,1], and we are naturally primarily interested 
in significant events for large |A7| values (cases where exons are significantly skipped or 
significantly retained relative to reference). To calculate an estimated significance level for 
|Ad!, we rely on similar intuition as used in previous studies,*?* that the normal reference 
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samples can be used to construct an empirical cumulative density function for each exon: 
Le 
Fre (ti) = z 2 Tis I<t (2) 


where 1, is the indicator function for event A. Given this exon-specific Fn, (t;), we can estimate 
an empirical p-value for each exon i in sample s 

a(s 1 A s 

pP = 5(1- ÊnsllAyPI) (3) 

Finally, as input to Splitpea, we filtered exons to only those that are significantly different 

from background (p* < 0.05) and those with a Ay change bigger than 0.05 (|Ay| > 0.05), 
defined as y~ below. We chose to use a p-value cutoff here as opposed to a multiple hypothesis 
corrected value to reduce false negatives, because we are interested in any possible rewiring 
events. We hope that this will better enable Splitpea’s use for hypothesis generation tasks. In 
general, these thresholds can be easily varied depending on the downstream purpose. 


2.3. Clustering Aw values 


For each cancer type, we remove any exons that had missing values in any of the samples, then 
filtered the exons by variance, keeping only those with variance greater than 0.01. The final set 
of Ay for each cancer type were clustered using the complete hiearchical clustering algorithm 
and plotted with the heatmap.2 function in the gplots R package.’ Clinical annotations for 
TCGA samples were obtained from the Genomic Data Commons portal with Pam50 calls 
from Netanely et al.*° 


2.4. Network rewiring algorithm 


There is inherent complexity in considering the impact of exon changes on protein domains, 
and finally, proteins, as there are several many-to-many relationships. A single exon can include 
multiple protein domains, but a single protein domain can also span multiple exons; proteins 
can thus consist of multiple exons as well as multiple protein domains. Splitpea hones in on 
potentially domain-mediated protein interactions by first overlaying DDIs on the aggregated 
PPI network based on the presence of each of the domains that constitute the pair of interactors 
in the protein. In other words, for a pair of proteins gı and g2, we consider protein domain dı 
in gı and domain dz in g2 as potentially mediating a known PPI between gı and go if a DDI 
has been reported between dı and d2. Fig. 1A depicts an example interaction where several 
DDIs potentially mediate the same PPI. 

In the event that there are multiple exons within the same protein domain, we attribute 
the minimum Ay value to the entire protein domain. The underlying assumption here is 
that loss of any portion of a particular protein domain may potentially negatively impact 
the protein domain’s downstream capacity to interact with other domains. Splitpea then 
determines the directionality of change based on whether or not there is consistency across 
the changing domains. In the event that there are mixed exon changes, the directionality 
is labeled as “chaos,” or undetermined (Fig. 1B). The weight of the edge is calculated as 
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Fig. 1. Overview of Splitpea. Splitpea combines prior knowledge in the form of protein-protein and 
domain-domain interactions with splicing changes to provide a view of a rewired network for a given 
experimental context. Splitpea defines a rewiring event when exon changes affect an underlying 
domain-domain interaction. Toy scenarios that would result in the three possible rewiring events 
predicted by Splitpea are illustrated in B. 


the mean domain-level Ay values. Essentially, the following pseudocode describes the crux of 
Splitpea’s algorithm for a given sample with a set of exons with associated Ay values: 


for each PPI between gu, gy do 
U™) := significant exons for gene u 
WU) := significant exons for gene v 
D™ := {duldu E€ gu, 3 exon; s.t. exon; € Ww & exon; € du} 
D®) := {dy|dy € gy, 3 exon; s.t. exon; € Ww) & exon; € dy} 
Ww := network rewiring edge weight between gy, gy 
Ouy := direction classification of network rewiring between gu, Jy 
for each DDI between du € De, d, € D™ do 
Awa, = min({Ay;| exon; s.t. exon; € VU & exon; € d,}) 
Awa, := min({Ay;| exon; s.t. exon; E€ VU & exon; € d,}) 


if Vd,,,dy € DDI (dy, dy), Ava, > 0, Ava, > 0 then 
Ouy = positive 

else if Vd,,,d, E€ DDI(dy, do), Ava, < 0, Ava, < 0 then 
Ouy = negative 

else 
uv = chaos 


Wuv = Dee] adeno ve Eene T) 


return Wuv, Ôuv 
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Wy and dy, are reported as long as U™ or UV) is non-empty. Please note that the ww 
calculation only includes domains that have a DDI that is considered to be mediating the PPI 
between gu, gu. For readability, the equation above omits the removal of non-DDI pairs. 


2.5. Consensus network 


The main factor to consider when aggregating several sample-specific Splitpea networks into 
a consensus network is whether the directionality of edges agree. Thus, a “positive” consensus 
network and “negative” consensus network are built separately. “Chaos” edges are ignored 
since they are of ambiguous state. For each consensus network, two factors are considered for 
the edge weight: the sum of the original edge weights wuy and how many networks support 
the same directionality ôw. The downstream analysis with each consensus network focuses 
on the largest connected component. As is common in biological networks, we found that 
the largest connected component covers the majority of the edges of the complete consensus 
network (breast cancer: 96.4% edges retained in negative consensus, 89.5% edges retained 
in positive consensus; pancreatic cancer: 96.1% edges retained in negative consensus, 88.8% 
edges retained in positive consensus). 


2.6. Network embedding and clustering 


To enable network clustering and other downstream uses of the Splitpea patient-specific net- 
works, we created whole graph level embeddings. Here, we chose to focus only on potential 
gain-of-interaction edges and first filtered each patient-specific network accordingly. Taking 
the largest connected component, we applied the FEATHER“ algorithm from the KarateClub 
NetworkX extension library*? to generate an embedding for each network. 

We clustered the resulting embeddings for each cancer type using hierarchical density-based 
clustering (HDBSCAN)*? with minimum cluster sizes of 10. Clustering results were generally 
robust to the choice of the minimum cluster size parameter; 10 was chosen for downstream 
interpretability (and we would consider samples with fewer neighbors as outliers). Final plots 
were produced using principal component analysis (PCA), plotting all embeddings by their 
first two components. 


3. Results 
3.1. Quantifying splicing changes in pancreatic and breast tumors 


In total, we collected data from TCGA covering 177 pancreatic primary tumors and 1,088 
breast primary tumors, together with 192 normal pancreatic tissue and 218 normal breast tis- 
sue samples from GTEx that were used as a reference distribution of normal splicing variation 
for each respective cancer type. With these data, we calculated a Ay value corresponding to 
the change in exon splicing in each tumor sample relative to its normal tissue background, 
resulting in Aw estimates for a total of 139,661 unique exons across all breast cancer samples 
and 98,761 unique exons across the pancreatic cancer samples. Furthermore, we calculated an 
accompanying p-value that compares how extreme the observed 7 value for each exon in each 
cancer sample is relative to the corresponding background distribution of w values for normal 
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tissue samples (see Methods). 


3.2. Aw values primarily reflect primary diagnoses 


We then clustered the Ay matrices for each tumor type and checked whether they corre- 
sponded to relevant clinical and pathological tumor features for both breast cancer (Fig. 2A) 
(pam50 subtypes, diagnosed type, pathologic stage, and age) and pancreatic cancer (Fig. 2B) 
(site of origin, diagnosed type, pathologic stage, age, and sex). While the majority of clinical 
features are not meaningfully clustered with Aw values, we do observe that the most unique 
patient cluster for pancreatic cancer (far right columns in Fig. 2B) are all pancreatic neuroen- 
docrine tumors. Neuroendocrine tumors are a rare subset of pancreatic cancers that originate 
not in the cells of the pancreas but in neuroendocrine cells. Interestingly, this cell type has 
commonality with neurons which are known to undergo more splicing changes.*4 For breast 
cancer, we see some clustering of lobular carcinomas (red cluster in “type” bar Fig. 2A), but 
otherwise do not see obvious patterns of clinical or pathological separation with Aw values 
alone. 


delta PSI 


site, 

p 

i CEIA F kr I l i j I h | _ Stage 
agè 
sèx 


exons 
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T A A) A E a A 


breast invasive carcinoma samples pancreatic adenocarcinoma samples 


Fig. 2. Clustering on Aw values. We cluster the Aw values showing different sample groups for 
different spliced exons. Heatmaps depict splicing changes relative to average normal tissue back- 
ground. Bar columns show known clinical information about each sample. In general, there are more 
subgroup level exon changes for breast cancer, (A) but these are not strongly correlated with any 
clinical variable. In pancreatic cancer, a small subset of neuroendocrine samples (B, dark blue) share 
similar splicing patterns. All other samples do not have obvious meaningful structure. 
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3.3. Quantifying rewired protein-protein interactions for pancreatic and 
breast tumors 


We applied Splitpea to build patient-specific rewired PPI networks for 177 pancreatic and 
1,088 breast primary tumor samples. Each PPI network contains three types of edges (gain, 
loss, or chaotic (mixed)) based on how underlying splicing changes may affect the individual 
protein-protein interaction (Fig. 3). In general, most splicing changes cause potential loss of 
protein interactions, though breast cancer had relatively fewer loss of edges proportionally 
on average (76% edges) than pancreatic cancer (84% of edges). Chaos (mixed) edges, where 
domain interactions have inconsistent directions per protein are relatively uncommon and 
comprise on average less than 2% of total edges for pancreatic and breast cancer. Between 
the two cancer types, breast cancer has more potential gain-of-interaction edges and a lower 
proportion of potential lost edges relative to pancreatic cancer. Interestingly, there is also more 
variability across edge types per sample in breast cancer samples. 


breast cancer pancreatic cancer 


1.00 


“yy 


0.00 


proportion of edges 
3 


positive negative chaos positive negative chaos 


Fig. 3. Proportion of relative gain and loss in edges across breast cancer and pancreatic cancer 
samples. Breast cancer samples have proportionally more “gain of interactions” than pancreatic 
cancer samples, but in both cancer types, interaction loss is much more prevalent. For each TCGA 
cancer sample, the proportion of edges gained versus lost is calculated using the total number of edges 
in the largest connected component of the entire Splitpea rewired network (both directions) as the 
denominator. To be conservative, the number of edges retained in the largest connected components 
for the gain-only subnetwork and loss-only subnetworks are used as numerators. 


Looking at individual patient networks (Fig. 4), we can see potential hubs and protein 
clusters that undergo extensive remodeling. In Fig. 4A, we show an example of one pancreatic 
tumor network with the most remodeling changes in the oncogene, RAB35, proto-oncogenes, 
HRAS and FYN, the signaling protein, MAPK3, the cell cycle and growth genes, NEDD8 and 
PRKAAI, among others. Breast cancer patient-specific networks have a different topology 
(Fig. 4C), though there is also overlap of proto-oncogenes HRAS and FYN. 


586 


Pacific Symposium on Biocomputing 2024 


Ne 
Ua 
a4. 
a m A 
i ¥ . 
k Ts ý 
B a C i j 
"T } 
ss É 
3 ? ‘ 
wy 
z E 
c2 A * 
+ A 
£ 
H € 
Ši RID 
3 aA 
9 T T T 
1.0 0.5 00 o5 1.0 
delta.psi 


Fig. 4. Patient specific rewired networks. Here, we show two sample network outputs from Splitpea 
and the accompanying exon value cutoff. The large network (A) depicts pancreatic patient sample 
(TCGA-HZ-7918-01A-11R-2156-07), with edge losses in red and gains in blue. The corresponding 
volcano plot is shown in (B), where exons with significant Aw (p < 0.05) as well as absolute change 
(|Aw| > 0.05) are shown in red. Box (C) shows a patient-specific network for an example breast 
cancer sample, TCGA-BH-AOBG-01A-11R-A115-07, which exhibits a very different topology from 
the pancreatic sample in A. 


3.4. A consensus network of changes across breast cancer patients 


While patient-specific networks highlight network rewiring at the level of individual tumor 
samples, we also sought to look for more general cancer level patterns of PPI rewiring. Towards 
this end, we assembled a consensus rewiring network for breast cancer by taking splicing 
rewiring events conserved across 80% of patient samples and assembling a meta-network of 
these events. Edges were only preserved when their type (gain, loss) was consistent. Chaos 
edges were not included in the consensus network. Naturally, as the threshold increases, the 
number of genes preserved in the network decreases (Fig. 5A). Interestingly, up through the 
80% threshold, gained edges are relatively more consistently preserved (Fig. 5B). Visualizing 
the breast cancer consensus network (Fig. 5C) revealed that the most gained interaction 
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Fig. 5. Meta-network of breast cancer patients. The line graphs show the number of nodes preserved 
for different consensus thresholds (A) or the proportion of nodes relative to the non-thresholded 
consensus network (B) for edge loss (negative, red) and edge gain (positive, blue) events. The dashed 
line in both graphs denotes a threshold of 80%, corresponding to the visualization of the consensus 
network of splicing rewiring events conserved across 80% of breast cancer patient samples (C, red: 
edge loss; blue: edge gain). 


involved the gene, FKBP5, which is an immune regulator responsible for protein trafficking 
and folding. This protein has been studied in breast cancer for its various hormone receptor 
signaling functions.*° 


3.5. Network clusters reveal novel patient subgroups 


The patient-specific networks generated by Splitpea have many downstream applications, 
especially when the networks are used as features for other machine learning tasks. Here, 
we demonstrate their utility by finding patient subgroups across both breast and pancreatic 
cancer when the networks are clustered (Fig. 6). Specifically, we use a state-of-the-art graph 
embedding method, FEATHER,*! which calculates characteristic functions using different 
random walk weights for node features, but any graph embedding method could be used 
for this type of analysis. For each cancer type, we clustered the network embeddings using 
HDBSCAN (see Methods). Interestingly, three distinct groups emerged across the cancer 
types (Fig. 6A). The dominant source of variation across the networks is the gain or loss of 
PPIs involving KRAS (Fig. 6B). Mutations in KRAS are known to affect subgroups of both 
pancreatic and breast cancer*® with ties to prognosis. It is possible that splicing changes in 
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interacting partner genes also induce changes to KRAS that may have yet unknown interaction 
effects with these somatic mutations, highlighting the potential of Splitpea to find additional 
disease subtypes. Furthermore, other interesting cancer drivers have distinct patterns of gains 
and losses, including RAB5A, which appears to have PPI gains in the BRCA outliers, and 
IKBKB, which is enriched for gains in the predominantly pancreatic cancer cluster 3. 
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Fig. 6. Splitpea networks cluster into distinct subgroups. (A) PCA plots of graph embeddings of 
each patient-specific Splitpea network, with samples colored by either cancer type (left) or cluster 
(right). Clusters were assigned using HDBSCAN, with outliers colored in grey. (B) For each cluster, 
the top nodes undergoing the most changes (mean interactions gained or lost) were also identified. 
The bar graphs are roughly separated by genes that have the most gain of interactions (left) versus 
those that have primarily losses (right). Interestingly, the main variation captured in PC1 seems 
to be defined by networks that change in KRAS. Other cancer driver genes also undergo distinct 
patterns of gains and losses that drive clustering patterns. 
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4. Discussion and conclusion 


We present a new method, Splitpea, for characterizing protein-protein network rewiring events. 
Splitpea is flexible and can be applied with different background contexts to highlight splicing 
changes between a disease and relevant background context of interest. We applied Splitpea 
to breast and pancreatic cancer samples to highlight the potential of Splitpea to find new and 
relevant cancer biology, both on an individual patient sample level and more broadly across 
samples of a single tumor type. To our knowledge, Splitpea is the first systematic method for 
identifying both potential gains in addition to PPIs lost for individual experimental samples. 

Splitpea makes heavy use of existing knowledge of protein-protein interactions. Because 
of this, our method is inherently limited by the availability of known PPIs (which are largely 
incomplete), as well as DDIs, which are even less complete. As more of these are experimentally 
characterized, Splitpea will continue to improve, capturing more accurate and comprehensive 
sets of network rewiring events. Since we wrote Splitpea to be modular, updates to PPIs and 
DDIs can be easily integrated once they become available. Specifically, study bias is a well- 
reported issue in PPIs, and thus there is a large amount of overlap between well-studied nodes 
(including many cancer driver genes) with nodes of high degree in PPI networks, and given 
the dependency of Splitpea on reported PPIs, this also affects our results. As more systematic 
experimental PPI screens and more reliable PPI predictions become available, we can also 
readily adapt Splitpea. 

We have only scratched the surface of cancer biology here. In our initial exploration of 
breast and pancreatic cancer, we have discovered subgroups and outliers within each cancer 
type that can be characterized by different network hubs. We believe this merits more thorough 
exploration, as it may carry important implications for precision medicine efforts. Beyond 
this, it will also be interesting to apply Splitpea to more cancer types and look for pan-cancer 
conservation patterns. 
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Access to safe and effective antiretroviral therapy (ART) is a cornerstone in the global response to the 
HIV pandemic. Among people living with HIV, there is considerable interindividual variability in 
absolute CD4 T-cell recovery following initiation of virally suppressive ART. The contribution of host 
genetics to this variability is not well understood. We explored the contribution of a polygenic score 
which was derived from large, publicly available summary statistics for absolute lymphocyte count from 
individuals in the general population (PGStympn) due to a lack of publicly available summary statistics for 
CD4 T-cell count. We explored associations with baseline CD4 T-cell count prior to ART initiation 
(n=4959) and change from baseline to week 48 on ART (n=3274) among treatment-naive participants in 
prospective, randomized ART studies of the AIDS Clinical Trials Group. We separately examined an 
African-ancestry-derived and a European-ancestry-derived PGStympn, and evaluated their performance 
across all participants, and also in the African and European ancestral groups separately. Multivariate 
models that included PGStympn, baseline plasma HIV-1 RNA, age, sex, and 15 principal components 
(PCs) of genetic similarity explained ~26-27% of variability in baseline CD4 T-cell count, but PGSiympn 
accounted for <1% of this variability. Models that also included baseline CD4 T-cell count explained 
~7-9% of variability in CD4 T-cell count increase on ART, but PGStymph accounted for <1% of this 
variability. In univariate analyses, PGSiympn was not significantly associated with baseline or change in 
CD4 T-cell count. Among individuals of African ancestry, the African PGSiympn term in the multivariate 
model was significantly associated with change in CD4 T-cell count while not significant in the 
univariate model. When applied to lymphocyte count in a general medical biobank population (Penn 
Medicine BioBank), PGStympn explained ~6-10% of variability in multivariate models (including age, 
sex, and PCs) but only ~1% in univariate models. In summary, a lymphocyte count PGS derived from 
the general population was not consistently associated with CD4 T-cell recovery on ART. Nonetheless, 
adjusting for clinical covariates is quite important when estimating such polygenic effects. 


Keywords: HIV; Polygenic Scores; Lymphocyte Count; CD4 T-Cell Count; Pharmacogenomics 
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1. Introduction 
1.1. Incomplete CD4 T-Cell Recovery in Response to Antiretroviral Therapy 


Human immunodeficiency virus type 1 (HIV-1) is a global health challenge, with 38.4 million 
individuals worldwide living with HIV!, including nearly 1.2 million in the United States”. This virus 
depletes CD4 T lymphocytes (hereafter referred to as CD4 cells), a critical component of the immune 
system’. Effective antiretroviral therapy (ART) controls viral replication, improves health and prevents 
transmission. With viral load reduction, CD4 cell counts may return to normal levels, but in many 
individuals this is not achieved>’. Understanding the etiology of CD4 cell recovery is important 
because individuals with lower CD4 cell counts may be at increased risk for non-AIDS conditions such 
as hepatic cirrhosis, cardiovascular disease, kidney disease, and cancer’. 

The etiology of incomplete CD4 cell recovery has not been fully elucidated, but many biological, 
demographic, treatment, and genetic factors have been associated’. Individuals who begin ART with 
CD4 cell counts <200 cells/mm? are less likely to achieve normal CD4 cell counts >500 cells/mm>*’. 
Other biological factors associated with this treatment response include higher body mass index (BMI), 
lower naive/memory CD4* cell ratio, lower CD4/CD8 cell ratios, and other immunological factors’. 
Demographic factors have also been associated with poor CD4 cell recovery including older age, male 
sex, and Eastern African ancestry, as well as specific ART regimens” !°. Additionally, variants that 
influence the absorption, distribution, metabolism, and elimination of ART may also play a role!!. 
Genes with single nucleotide polymorphisms (SNPs) reported to be associated with CD4 cell recovery 
on ART have included IL-2, IL-2RB, IL-2Ry, IL-15, IL-15Ra, TRAIL, Bim, TNF-a, and IFN-y'?. One 
particular SNP (rs6897932) in JL7RA was associated with a faster CD4 cell count increase in individuals 
of both European and African ancestry, but another SNP in this gene (rs3 194051) was only associated 
with this response in individuals of African ancestry'?4. Another study suggested that differences in 
CCRS genotype and CCL3L1 dosage were associated with the extent and rate of CD4 cell recovery’. 
Additionally, HLA-Bw4 homozygosity was associated with impaired CD4 cell recovery!*. Particular 
mitochondrial DNA haplogroups were associated with CD4 cell recovery in individuals of European 
and African ancestry!”'*. More recently, whole exome sequencing associated 41 genes with CD4 cell 
response in females!’. 

Although multiple genes and SNPs have been associated with poor CD4 cell count recovery on 
ART, these explain a small fraction of the variance. Previous studies considered effects of SNPs 
individually, which fails to consider whether combinations of many SNPs may explain a larger portion 
of the variance. Many conditions are polygenic (e.g., coronary artery disease), meaning that many genes 
and variants have impact”. It is conceivable that CD4 cell recovery on ART is also polygenic, so it is 
worth exploring whether polygenic scores may explain a larger portion of the genetic variance, which 
has never been investigated for this treatment response. Furthermore, understanding the 
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pharmacogenomic underpinnings of treatment response has the potential to better individualize 
therapy7!. 


1.2 Polygenic Scores May Predict Complex Treatment Responses 


One way to assess the contribution of many variants in combination is by applying Polygenic Scores 
(PGS), which are the mathematical, cumulative aggregation of risk derived from the total contribution 
of numerous variants in the genome”. PGS effectively predict phenotypes such as schizophrenia”* 7’, 
bipolar disorder?>8?, breast cancer*?°3, type 2 diabetes??3435, coronary artery disease *°°*°, and atrial 
fibrillation*°**°7, Given their success in other disease areas, it is plausible that PGS could predict poor 
CD4 cell recovery in response to ART. 

When using PGS, it is important to consider the potential for ancestral health disparity. Across many 
phenotypes, PGS is more predictive for individuals of European ancestry because this population has 
more readily available summary statistics from large genome-wide association studies (GWAS)**. An 
ultimate goal of PGS is clinical implementation so that patients can be informed of their genetic risk 
for disease**. However, clinical implementation could create a larger health disparity whereby 
individuals of European ancestry may more readily benefit from these risk prediction models**. Thus, 
it is important to improve risk prediction for global populations. This is particularly important for HIV 
given its global distribution of prevalence, particularly in Africa. We hope to better predict genetic risk 
in individuals of African ancestry by generating a PGS based on summary statistics generated in a 
dataset of individuals largely of African ancestry, in addition to a PGS generated in a dataset of 
individuals largely of European ancestry. Additionally, we plan to use PRScsx, a method that more 
effectively predicts polygenic risk in global populations*?. 

In this study, we assess whether the PGS generated from a general population is predictive of CD4 
cell recovery in persons living with HIV (PWH). A similar approach used a body mass index PGS 
generated from a general population to study ART-associated weight gain*®. As there are no large 
GWAS studies of CD4 cell count, either in the general population or in PWH, we generate statistical 
power by using summary statistics on total lymphocyte count from a general population, for which 
large sample sizes are publicly available. Finally, the principle of predicting phenotypic effects in a 
population affected by a health condition by using genetics from the general population was effective 
in one study that found that variants associated with cardiac QRS duration in individuals without cardiac 
diseases were also associated with arrhythmia and atrial fibrillation*!. We assess whether this same 
principle applies to treatment response by testing whether the genetic underpinnings of lymphocyte 
count in a general population predicts CD4 cell recovery in PWH. We hypothesize that cumulative 
genetic variants that affect total lymphocyte count also affect recovery of the CD4 T cell subset in 
response to ART (i.e., that a lymphocyte count PGS [PGSiympn] generated from the general population 
will be associated with CD4 cell recovery on ART). We also hypothesize that PGSiymph will be 
associated with CD4 cell counts prior to initiating ART. 
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Figure 1: Study Overview: EUR and AFR PGSiympn were trained using lymphocyte count GWAS summary 
statistics. Both PGSiymph were applied to individuals in the AIDS Clinical Trials Group (ACTG) to assess its predictability 
of CD4 cell response to ART. 


2.1 Data and Study Participants 
2.1.1 Lymphocyte Count Meta Analysis 


We used publicly available summary statistics from a published meta-analysis of existing GWAS for 
lymphocyte count in populations of European and African ancestry in the general population}. The 
meta-analysis included 524,923 individuals of European ancestry with 47,264,266 SNPs, and 13,477 
individuals of African ancestry with 34,121,887 SNPs**. The European ancestry summary statistics 
were subset to 1,120,498 SNPs that were present on the European linkage disequilibrium (LD) panels 
and the African ancestry summary statistics were subset to 1,225,091 SNPs that were present on the 
African LD reference panels. 


2.1.2 AIDS Clinical Trials Group 
Participants were ART-naive individuals who had initiated ART in prospective, randomized clinical 


trials of the AIDS Clinical Trials Group (ACTG), and had consented to genetic research and provided 
DNA under ACTG protocol A5128®#. Data were generated by conducting a retrospective analysis of 
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this cohort. Individuals had initiated ART in the United States in studies ACTG384, A5095 
(NCT00013520), A5142 (NCT00050895), A5202 (NCT00118898), and A5257 (NCT25285539)4 47, 
All participants provided written, informed consent for genetic testing. Drug class components of 
regimens were randomly assigned except for nucleoside reverse transcriptase inhibitor (NRTI) choice 
in A5142. Included individuals had the following data: imputed genotype, sex, genetically inferred 
ancestry (GIA), lymphocyte count or CD4 cell count data. Additional eligibility criteria included HIV- 
1 RNA <400 copies/mL at week 48 on ART. 


2.1.3 Penn Medicine BioBank 


The Penn Medicine BioBank (PMBB) is an electronic health record (EHR)-linked biobank research 
program at the University of Pennsylvania*®. PMBB participants included in this study provided consent 
for research including access to their medical records, blood sample collection, and generation of 
genetic data‘. Individuals with both imputed genotype data from PMBB v2.0 and with lymphocyte 
count data were included in PGS analysis as a positive control. Included individuals had the following 
data: imputed genotype, lymphocyte count, sex, and GIA. 


2.2 Genotyping and Quality Control 
2.2.1 AIDS Clinical Trials Group 


DNA extracted from whole blood was labeled with coded identifiers and genotyped in seven phases. 
Phases 1-3 were genotyped at the Broad Institute (Phases 1 and 2 with HumanHap650Yv3_A, and 
Phase 3 with HumanlM-Duov3_B). Phases 4-7 were genotyped at the Vanderbilt Technologies for 
Advanced Genomics (VANTAGE) facility (Phase 4 using the Human Core Exome chip, phase 5 with 
the HumanOmni2.5Exome-8-v1.1_A1 chip, Phase 6 with the HumanOmni25-8v1-2_A1 chip, and 
phase 7 with the Illumina Infinitum Multi-Ethnic Global BeadChip (MEGA*). 

Post-genotype quality control procedures utilizing PLINK v1.9? were conducted by Vanderbilt 
Technologies for Advanced Genomics Analysis and Research Design (VANGARD). Prior to 
imputation, samples with genotyping efficiency < 99% or with discordance between genotype sex and 
reported sex were removed. After completing these quality control procedures, each genotyping phase 
was imputed separately utilizing the TOPMed reference panel, which was parallelized by chromosome 
to increase computational efficiency’. During the imputation process, liftOver was used to transform 
genotype data to genome build 385°. After imputation, PLINK was used to merge the seven imputed 
datasets, and variants with imputation R? scores < 0.3, genotyping call rates < 95%, or minor allele 
frequency (MAF) < 0.05 were dropped}. GIA was determined using principal component analysis 
(PCA) with 1000 Genomes as the reference, subsequently assigning each participant to one of six 
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superpopulations: African (AFR), Admixed American (AMR), East Asian (EAS), European (EUR), 
South Asian (SAS), and Other. 


2.2.2 Penn Medicine BioBank 


DNA was extracted from blood samples. Approximately 80% of samples were genotyped by the 
Regeneron Genomics Center (RGC) using an Illumina Global Screening Array v.2.0 (GSAv2)*%, while 
the remaining 20% were genotyped by the Center for Applied Genomics (CAG) at the Children’s 
Hospital of Philadelphia using the GSAv1 and GSAv?2 genotyping array*®. 

Prior to imputation, sample level quality control was conducted**. Using PLINK v1.9, variants with 
genotyping call rates < 95%, individuals with sample call rates < 90%, and individuals with discordance 
between reported sex and genotype sex were dropped**. Autosomes were imputed utilizing a TOPMed 
version R2 genome build 38 reference panel**®*°. After imputation, variants with imputation R? scores 
< 0.3, genotype call rate < 99%, MAF < 1%, and/or were multi-allelic were dropped using PLINK 
v1.948. Individuals with sample call rate < 99% or discordant sex information were also dropped**. PCA 
was done to identify GIA using 1000 Genomes as the reference and subsequently separated individuals 
into six superpopulations: African (AFR), Admixed American (AMR), East Asian (EAS), European 
(EUR), South Asian (SAS), Other*®. 


2.3 Polygenic Score Calculation 


The PGSiymph was constructed using PRScsx (version released on July 29 2021), which integrates 
summary statistics and LD panels across genetically diverse populations to better predict polygenic risk 
in global populations*’. 1000 Genomes phase 3 LD reference panels were used in the calculation*!. 
Summary statistics from the lymphocyte count meta-analysis were used to train the PGStympn”. The 
PGSiymph was applied to ACTG study participants with CD4 cell count data using PLINK2 “--score” 
function”. As positive controls, the PGSiymph was also applied to individuals with lymphocyte count 
data in ACTG as well as individuals with lymphocyte count data in PMBB. 


2.4 Statistical Analysis 


The results were analyzed to assess model predictability across all ancestries combined, and in 
European and African ancestries separately. Linear regressions were calculated, and performance was 
assessed with an R? value generated from a multivariate linear regression between the phenotype of 
interest and the PGSjymph. Additionally, performance of individual covariates was assessed with effect 
sizes generated from these regressions. We used a p-value threshold of 0.05 to assess significance. 
Regressions were calculated in individuals of European and African ancestry only, as well as 
individuals of all superpopulations combined. PGSiymph was applied to two different cohorts, ACTG 
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and PMBB. In ACTG, the predictability of the PGSiymph for three different phenotypes was assessed: 
the square root (SQRT) of CD4 cell count at study entry prior to ART (baseline), change in CD4 cell 
count from study entry to 48 weeks of ART (a measure of treatment response), and inverse normal 
lymphocyte count prior to ART (a control variable). We performed two regressions for each phenotype, 
one without correcting for any covariates, and one correcting for age, sex, principal components (PC) 
of genetic similarity 1-15, as well as logio-HIV-1 RNA (a measure of viral load). Additionally, we 
adjusted for SQRT of baseline CD4 cell count in regression models between PGSjymph and change in 
CD4 cell count on ART. In addition to these regressions, we also evaluated interactions between the 
PGSiymph and age, sex, viral load, and baseline CD4 cell count to identify whether PGSiymph interacts 
with any covariate. In PMBB, the predictability of PGStympn for inverse normal lymphocyte count was 
assessed as a positive control and to understand predictability in a general medical biobank population. 
Similarly, two regressions were performed, one without correcting for covariates, and one correcting 
for age, sex, and PC1-15. These results were visualized using Synthesis View”. 


3. Results 
Table 1: ACTG Participant Demographics at Baseline 
Lymphocyte Count Baseline CD4 Cell On-Treatment CD4 Cell 
Data Count Data Count Data 

Total, N 4680 4959 3274 
European ancestry, n (%) 1835 (39.2%) 1958 (39.4%) 1319 (40.3%) 
African ancestry, n (%) 1721 (36.8%) 1826 (36.8%) 1154 (35.2%) 
Male/Female, n (%) 3824/856 (81.7%/18.3%) 4051/908 (81.7%/18.3%) 2715/559 (82.9%/17.1%) 
Age, mean (range) 37.9 (17.0-77.0) 38.0 (17.0-77.0) 38.2 (17.0-76.0) 


Table 2: PMBB Demographics 


Lymphocyte Count Data 
Total, N 37211 
European ancestry, n (“%) 25330 (68.1%) 
African ancestry, n (%) 10217 (27.5%) 
Male/Female, n (%) 18215/18996 (49.0%/51.0%) 
Mean Age (Range) 55.6 (13.9-101.7) 
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Figure 2: Summary of Regression Results Between PGSiympn and Phenotype without Controlling for Covariates 
(Age, Sex, PC1-15, logioHIV-1 RNA (viral load), and SQRT of baseline CD4 cell count) 
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Figure 3: Summary of Regression Results Between PGSiympn and Phenotype While Controlling for Covariates (Age, 
Sex, PC1-15, logioHIV-1 RNA (viral load), and SQRT of baseline CD4 cell count) 


4. Discussion 


A lymphocyte count PGS trained in the general population did not effectively predict baseline CD4 cell 
count or change in CD4 cell count in response to ART, leading to rejection of our hypothesis that poor 
CD4 cell recovery in response to ART is dependent on each individual’s overall genetic predisposition 
to this outcome. When running regressions without correcting for covariates, R? values were low across 
all ancestry groups and most regressions were not statistically significant (Figure 2, Supplementary 
Table 1. In contrast, clinical covariates were predictive of these phenotypes. When correcting for 
covariates, performance of the model improved markedly. Baseline regressions performed modestly 
(R? = 0.278) while on-treatment regressions were not very predictive (R? = 0.073), although all values 
were Statistically significant (Figure 3, Supplementary Table 2). However, because the PGSiymph itself 
was not highly predictive, the success of this model was mostly due to the contribution of covariates. 
Additionally, when including covariates in the model, the model including the African PGSjympn better 
predicted change in CD4 cell count on-treatment in individuals of African ancestry than the model 
including the European PGSiymph (R? was greater by 0.003) (Figure 3, Supplementary Table 2). This is 
the only case where we see improved performance by an AFR PGSjymph compared to a EUR PGSiymph. 
Interestingly, when considering effects of individual covariates in this model, the influence of the AFR 
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PGSiymph is significant (p = 0.044) in individuals of African ancestry with an effect size of -2.062 
(Supplementary Table 3). In comparison to other covariates, this effect size is minimal, but suggests 
that the AFR PGSitympn is playing a role. Furthermore, this shows that our methods improved PGStymph 
performance in individuals of African ancestry, which was likely because of a combination of a 
PGSiymph based on African ancestry summary statistics and utilizing PRScsx for calculation. 

In univariate analyses, lymphocyte count PGS did not effectively predict baseline lymphocyte 
count in ACTG participants. R? values were also low and insignificant (Figure 2, Supplementary Table 
5). Performance improved when including covariates in this model, as R? values rose to ~0.10 and 
regressions became statistically significant (Figure 3, Supplementary Table 6). Within the covariate 
models, the influence of the EUR PGSiympn is significant in individuals of European ancestry (p = 0.018) 
with a minimal effect size of 0.025 (Supplementary Table 7). However, as the effect size is small, 
though significant, the EUR PGSiympn is not adding much to this model. Still, this significant effect is 
exhibited as the R? value of the EUR PGStymph covariate model in individuals of European ancestry 
(0.103) is slightly higher than the R? value of the AFR PGSiymph covariate model in individuals of 
European ancestry (0.101) (Figure 3, Supplementary Table 6). Additionally, in the multivariate model, 
the influence of the AFR PGStympn is significant in the multi-ancestry group (p=8.7e-3) with an effect 
size of 8.3e-3 (Supplementary Table 9). Although this evidently did not have a large impact on the 
model, the effects of this are still present as the R? value of the AFR PGSiymph covariate model in the 
multi-ancestry group (0.098) is slightly higher than the R? value of the EUR PGSiymph covariate model 
in the multi-ancestry group (0.097) (Figure 3, Supplementary Table 6). Also, it is interesting that the 
R? value did not increase as high as in CD4 cell count regressions, perhaps because viral load was the 
greatest contributing covariate (viral load had the lowest p-value of all variables in all CD4 cell count 
regressions), and total lymphocyte counts are not greatly affected by viral load, in contrast to CD4 cell 
counts? (Supplementary Table 3). 

Although this model did not perform well in PWH, it performed slightly better when applied to 
a general medical biobank population. The PGSiympn best predicted lymphocyte count in a general 
medical biobank population. Regressions were highly statistically significant, likely due to a large 
sample size (~37,000 individuals). In the univariate model, the African PGSiymph applied to the multi- 
ancestry group and the European PGSiympn applied to the European population had the highest R? values 
(~0.01) (Figure 2, Supplementary Table 11). It is interesting that these regressions had the highest R? 
values, as these are the only ACTG lymphocyte count regressions that had a significant contribution 
from PGSjympn in the multivariable model. Seeing these patterns across the general population and PWH 
shows that the AFR PGStympn performs best in a multi-ancestry group and the EUR PGSiymph performs 
best in individuals of European ancestry. When controlling for covariates, performance of the model 
increased. R? values rose to ~0.06-0.10 and p-values dropped even lower (Figure 3, Supplementary 
Table 12). This mirrors the impact of covariates seen in PWH. The effect size of the EUR PGSiymph was 
~0.01 in all ancestry groups (Supplementary Table 13). It is interesting that without covariates, the EUR 
PGSiymph in individuals of European ancestry was the only regression mirroring this effect size (Figure 
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2, Supplementary Table 11). The effect size of the AFR PGStymph was much lower, ~-5e-3 
(Supplementary Table 14). This effect size was mirrored in the AFR PGStympn regressions without 
covariates in European and African ancestry, as the R? values were also low (~3e° or 8e%), but 
interestingly the R? value was higher when the AFR PGSiymph was applied to the multi-ancestry group 
(~0.01) (Figure 2, Supplementary Table 11). 

Although these results showed that PGSympn itself is not predictive of this treatment response, 
some results show that in combination with covariates, the impact of PGSiympn can become significant, 
suggesting a possible synergistic effect between PGStymph and clinical covariates in the model. In the 
regressions between AFR PGSiympn and change in CD4 cell count in individuals of African ancestry, 
the impact of the PGSiymph was insignificant, but when including clinical covariates in the regression, 
the impact of the PGSiymph became significant (Supplementary Table 3). However, the AFR PGStymph 
did not significantly interact with any covariates, eliminating the possibility of a synergistic effect 
(Supplementary Table 4). Additionally, in the regressions between the AFR PGSiympn and baseline 
lymphocyte count in PWH of all ancestry groups, as well as in the regressions between the EUR 
PGSiymph and baseline lymphocyte count in individuals of European ancestry, the same patterns were 
observed (Supplementary Table 7, Supplementary Table 9). Similarly, the AFR PGStympn did not 
significantly interact with any covariates, but the EUR PGStympn significantly interacted with age 
(Supplementary Table 8, Supplementary Table 10). Thus, it is possible that in PWH, there are 
synergistic effects between the EUR PGSiymph and covariates, thus leading the PGSiympn to become 
significant. These findings highlight the importance of including clinical covariates in PGS analyses, 
not only because the covariates themselves very predictive of treatment response, but also because they 
seem to interact with the PGSiymph in some way. Another explanation for this observation is that 
covariates with strong effects overshadow the effects of PGSiympn when not controlled for. Covariates 
such as viral load have such high significance and large effect sizes, that the effects of smaller impact 
variables such as PGSiymph are not seen unless these covariates were controlled for. Thus, it is important 
to consider clinical covariates when implementing PGS in a clinical setting. 

This study had several limitations. First, the sample size of the African ancestry summary 
statistics that were used to generate the African PGStymph were small (~13,000 individuals), which is 
due to the lack of availability of lymphocyte count summary statistics for individuals of African 
ancestry. To improve these results, more lymphocyte count GWAS data are needed in future studies, 
as it is possible that the AFR PGSiympn could have performed better with a larger base sample size. 
Additionally, the ACTG sample size was modest (~4600 individuals) which was subset to even smaller 
groups when stratified by ancestry. It is possible that associations with PGSiymph may have become 
statistically significant with a larger sample size. Subsequent work in this area could investigate whether 
this model is predictive of other drug response traits, specifically other ART treatment responses. 

Polygenic scores have the potential to leverage large, publicly available datasets to find novel 
genetic discoveries in pharmacogenomic cohorts. This study utilized a novel method to predict CD4 
cell recovery in response to ART and illustrated the importance of including clinical covariates in a 
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PGS model. As more associations or lack thereof are found, we continue to narrow down the biological 
underpinnings of responses to ART including suboptimal CD4 cell recovery. 
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Polygenic risk scores (PRS) have predominantly been derived from genome-wide association 
studies (GWAS) conducted in European ancestry (EUR) individuals. In this study, we present an 
in-depth evaluation of PRS based on multi-ancestry GWAS for five cardiometabolic phenotypes in 
the Penn Medicine BioBank (PMBB) followed by a phenome-wide association study (PheWAS). 
We examine the PRS performance across all individuals and separately in African ancestry (AFR) 
and EUR ancestry groups. For AFR individuals, PRS derived using the multi-ancestry LD panel 
showed a higher effect size for four out of five PRSs (DBP, SBP, T2D, and BMI) than those derived 
from the AFR LD panel. In contrast, for EUR individuals, the multi-ancestry LD panel PRS 
demonstrated a higher effect size for two out of five PRSs (SBP and T2D) compared to the EUR 
LD panel. These findings underscore the potential benefits of utilizing a multi-ancestry LD panel 
for PRS derivation in diverse genetic backgrounds and demonstrate overall robustness in all 
individuals. Our results also revealed significant associations between PRS and various phenotypic 
categories. For instance, CAD PRS was linked with 18 phenotypes in AFR and 82 in EUR, while 
T2D PRS correlated with 84 phenotypes in AFR and 78 in EUR. Notably, associations like 
hyperlipidemia, renal failure, atrial fibrillation, coronary atherosclerosis, obesity, and hypertension 
were observed across different PRSs in both AFR and EUR groups, with varying effect sizes and 
significance levels. However, in AFR individuals, the strength and number of PRS associations 
with other phenotypes were generally reduced compared to EUR individuals. Our study 
underscores the need for future research to prioritize 1) conducting GWAS in diverse ancestry 
groups and 2) creating a cosmopolitan PRS methodology that is universally applicable across all 
genetic backgrounds. Such advances will foster a more equitable and personalized approach to 
precision medicine. 


Keywords: Polygenic risk scores, multi-ancestry GWAS, cardiometabolic phenotypes, precision 
medicine 


1. Introduction 


The era of precision medicine has been marked by significant efforts to identify the genetic and 
environmental factors that influence the risk of disease as well as the disease prognosis and 
treatment. Advance knowledge of these factors can provide a major health benefit to individuals, 
as preventative strategies and tailored therapies can be targeted toward individuals at higher risk. 
Results from genome-wide association studies (GWAS) have highlighted the polygenic nature of 


© 2023 The Authors. Open Access chapter published by World Scientific Publishing Company and distributed 
under the terms of the Creative Commons Attribution Non-Commercial (CC BY-NC) 4.0 License. 
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most common, complex diseases in that they have identified a large number of loci with small 
genetic effects'*. The polygenic risk score (PRS) has thus emerged as a promising factor for 
predicting disease risk. PRS is the cumulative, mathematical aggregation of risk derived from the 
contributions of many DNA variants across the genome’. 

Recent studies have shown the high prevalence of cardiometabolic conditions among adults in 
the United States*, and together they are the leading cause of mortality around the world>®. GWAS 
have identified hundreds of loci associated with common diseases such as coronary artery disease 
(CADY, obesity’, hypertension’ (measured using systolic blood pressure [SBP] and diastolic blood 
pressure [DBP]), and type 2 diabetes (T2D)!°. Among the individuals that are diagnosed with one 
disease (for example, T2D), the prevalence of comorbidities such as hypertension, CAD, heart 
failure, and chronic kidney disease is also increased. To fully evaluate disease risk in an individual, 
it is therefore essential to also consider comorbid or secondary conditions related to the primary 
disease. There are several GWAS that have identified shared genetic associations between 
cardiometabolic conditions, demonstrating similarity in the underlying genetic architecture! !!?, 
Pathophysiology of these conditions also shows the cross-talk between organ systems and its effect 
on disease progression, such as hemodynamic interaction between heart and kidney in heart 
failure'?. With PRS, it is possible to derive an individuals’ disease risk for each cardiometabolic 
condition using GWAS summary statistics. PRS represents an aggregate measure of the 
cumulative effect of numerous genetic variants on a particular disease, capturing an individual's 
genetic predisposition. As such, PRS can be instrumental in assessing the genetic interplay among 
coexisting or comorbid conditions. 

Numerous methodologies exist for constructing PRS targeted at specific diseases. 
Conventionally, genetic risk scores (GRS) were derived using the genome-wide significant SNPs 
from a GWAS; however, recent studies show that using association results with much lower p- 
value significance (p<0.05) segregate individuals risk with better accuracy'. The development and 
clinical utility of PRS is under active investigation, especially in globally diverse populations'*'°. 
Most large-scale GWAS have been conducted in individuals from European ancestry populations 
and most PRS are derived from these studies. Subsequently, the majority of PRS investigations 
published to date have been conducted in populations of European ancestry'’. There can be several 
differences such as linkage disequilibrium (LD) structure and allele frequency of the variants, 
which can lead to inaccurate PRS for non-European populations'’. This is not unique to PRS 
studies, and the majority of human genetic research suffers from this same phenomenon!’. To 
ensure the successful clinical implementation of PRS, it is imperative to evaluate its performance 
in diverse global populations that closely reflect the healthcare population being treated. Moreover, 
for PRS to become a truly inclusive and effective tool for precision medicine, they must be 
applicable to individuals of all genetic backgrounds, including those with mixed ancestral 
backgrounds. Achieving this level of equity and broad usability will contribute significantly to the 
advancement of personalized healthcare practices. 

In this study, we investigated the implementation of PRS for cardiometabolic conditions in 
individuals in the Penn Medicine BioBank (PMBB). PMBB is a cohort of >250,000 individuals 
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established for genomic and precision medicine research. Approximately 45,000 of the individuals 
have genetic data imputed using the Trans-Omics for Precision Medicine (TOPMed) v2 dataset!’. 
20% of the PMBB study population is classified as African (AFR) ancestry based on genetic 
similarity to the 1000 genome (1KGP)”? AFR superpopulation group. We calculated PRS in the 
PMBB based on GWAS summary statistics generated in multi-ancestry data to evaluate 1) risk 
prediction accuracy among all individuals, and among AFR and European (EUR) subpopulations; 
and 2) the utility of PRS in determining genetic overlap among cardiometabolic conditions. 


2. Methods 
2.1. Penn Medicine BioBank 


The Penn Medicine BioBank (PMBB) recruits participants through the University of Pennsylvania 
Health System by enrolling at the time of appointment”!. Patients participate by donating either 
blood or a tissue sample and allowing researchers access to their electronic health record (EHR) 
information. This academic biobank provides researchers with centralized access to a large number 
of blood and tissue samples with extensive health information from the EHR. The facility banks 
both blood specimens (i.e., whole blood, plasma, serum, buffy coat, and DNA isolated from 
leukocytes) and tissues (i.e., formalin-fixed paraffin-embedded, fresh, and flash frozen). 


2.2. Genotyping and Quality Control and Imputation 


The DNA extracted from blood samples was genotyped using the Illumina Global Screening 
Array. To ensure data integrity, we conducted quality control measures, excluding SNPs with a 
marker call rate of less than 95% and samples with a call rate of less than 90%. Additionally, 
individuals with sex discrepancies were removed from the analysis. Imputation was carried out 
using the Michigan Imputation server, leveraging the TOPMed Reference panel!’. To determine 
genetic ancestry, we employed principal component analysis (PCA) using the smartpca tool?” and 
the 1KGP dataset”°. Genetic ancestry was inferred through a k-means clustering approach, utilizing 
the IKGP super populations as genetic ancestry labels. 


2.3. Polygenic Risk Scores 


To derive PRS, we used the multi-ancestry summary statistics from the largest and/or most recent 
GWAS studies for each trait (See Table 1). 


Table 1. Multi-ancestry GWAS 


Phenotype Sample size (N cases) PMID 

BMI 241,258 28443625° 
CAD 547,261 (122,733) 29212778" 
Hypertension (DBP, SBP) 318,891 30578418" 
T2D 1,407,282 (228,499) 32541925" 
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Weights for each SNP were calculated using PRS-CS” (version from April 24, 2020), a method 
that performs Polygenic Prediction via Bayesian regression and continuous shrinkage priors. PRS- 
CS requires a reference panel that matches the ancestry distribution of the target data set. We 
generated multiple reference panels for analyses: a multi-ancestry LD reference panel using the 
HapMap SNPs from the entire 1KGP populations (2504 individuals), an African-only reference 
panel from the IKGP African ancestry population, and a European-only reference panel from 
1KGP European ancestry population. We identified LD patterns within the 1KGP population by 
using PLINK (version 1.90) to determine LD blocks and calculate the LD between the SNPs in 
each block. For PRS-CS, the global shrinkage parameter @ was fixed to 0.01, and default values 
were selected for all other parameters. PRSs were then calculated using the weights with PLINK. 
Only the SNPs in the target data set, summary statistics, and LD reference panel were included in 
the PRSs. 


2.4. Phenotypes 


We focused on four primary phenotypes to derive and evaluate the PRS association: CAD, 
hypertension (for DBP and SBP PRS), T2D, and BMI. Cases and controls for each binary 
phenotype were defined using International Classification of Diseases (ICD-9 and ICD-10) 
diagnosis codes (CAD: 414.0*, 125.1*; T2D: 250*, E11*; hypertension: 401*, 110*). Participants 
were coded as cases of a given phenotype if their records contained at least 1 of the corresponding 
ICD-9 or ICD-10 codes. The median value for BMI was extracted from the EHR. 


For Phenome-wide Association Study (PheWAS) analysis, we derived phenotypes using ICD-9 
and ICD-10 data from individuals from the Penn Medicine EHR. ICD-9 codes were aggregated to 
phecodes using the phecode ICD-9 map 1.2747°; ICD-10 codes were aggregated to phecodes using 
the phecode ICD-10 map 1.2 (beta)?°. Individuals are considered cases for the phenotype if they 
have at least 2 instances of the phecode on unique dates, controls if they have no instance of the 
phecode, and ‘other/missing’ if they have one instance of the phecode or a related phecode. 


2.5. Statistical Analysis 


PRS were normalized (mean of 0 and standard deviation of 1) for each analysis separately 
(stratified by ancestry and overall). Logistic or linear regression models accounting for age, sex, 
and the first 5 within-ancestry principal components (PCs) were used to test for association of PRS 
with each of the primary phenotypes (T2D, BMI, hypertension, and CAD). Area under the receiver 
operator curve (AUC) and DeLong test was determined using the R package pROC, using the full 
logistic regression model as above. AUC was also calculated for a reduced logistic regression 
model including covariates alone (age, sex, and the first 5 PCs). The DeLong test?’ 
parametric approach used to compare the AUCs of two correlated ROC curves, especially when 
the models are applied to the same set of samples. This test was used to compare null model and 
full model that includes PRS and obtain a p-value indicating the statistical significance of the 


is a non- 
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difference between the two AUCs. For BMI, we treated it as a continuous trait and provided the 
R^2 value for all analyses. 


A PheWAS was performed using logistic regression models with each PRS as the independent 
variable, phecodes as the dependent variables, and age, sex, and the first 10 PCs as covariates. A 
phenome-wide Bonferroni significance threshold of 4.2 x 10° (0.05/1190) in AFR and 3.6 x 10° 
(0.05/1377) in EUR was applied to account for multiple testing. 


3. Results 
3.1. Penn Medicine BioBank (PMBB) Demographics 


PMBB currently consists of >250,000 consented individuals. Approximately 45,000 of these 
participants have been genotyped to date. Demographics of the sample included in this study are 
shown in Table 2. 


Table 2. Demographics of PMBB sample 


All AFR EUR 
Total patients 43,530 11,189 30,094 
% Female 50.1% 62.8% 44.9% 
Mean age 55.2 51.7 57.3 
% CAD 23.8% 18.8% 26.4% 
% Hypertension 54.4% 65.2% 51.7% 
% T2D 23.5% 35.1% 19.3% 
Patients with BMI data 40,043 10,619 27,489 
% Female 50.4% 63.4% 44.9% 
Mean age 55.6 51.9 Shel 


3.2. Determining the effect of linkage disequilibrium panel on PRS in the overall sample 


Using publicly available multi-ancestry GWAS data (Table 1), we generated a PRS for each 
primary phenotype of interest: type 2 diabetes, body mass index, hypertension (SBP and DBP), 
and coronary artery disease. We assessed the impact of using a multi-ancestry LD panel, akin to 
the GWAS data, and compared it with an AFR LD panel (in all PMBB individuals and in AFR 
PMBB individuals) and an EUR LD panel (in all PMBB individuals and in EUR PMBB 
individuals). AUC values were computed for each binary phenotype PRS in all individuals (Table 
3) and contrasted between the full model (AUC, covariates + PRS) and the model containing 
covariates alone (AUC Null). The addition of PRS consistently improved the covariate model for 
all phenotypes, showing an average AUC improvement of 0.014. Across the entire dataset, the 
PRS created with the multi-ancestry LD panel (DBP, BMI) or the EUR LD panel (CAD, SBP, 
T2D) demonstrated the strongest association with their respective primary phenotypes (Table 3). 
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Table 3. Comparison of LD panel for PRS in all 


PRS LD Panel AUC! AUC! DeLong P Model Model P- 
Null OR value 
Multi-ancestry 0.808 1.22E-53 1.495 5.82E-186 
CAD AFR 0.795 0.807 1.22E-52 1.472 7.11E-182 
EUR 0.807 2.33E-52 1.515 1.00E-184 
Multi-ancestry 0.773 8.90E-06 1.236 1.65E-49 
DBP AFR 0.770 0.772 1.32E-15 1.219 1.59E-49 
EUR 0.772 6.15E-14 1.226 6.32E-43 
Multi-ancestry 0.775 4.47E-23 1.365 2.48E-83 
SBP AFR 0.770 0.775 3.74E-22 1.338 2.78E-80 
EUR 0.775 7.40E-23 1.376 2.31E-83 
Multi-ancestry 0.730 5.41E-88 2.223 1.24E-286 
T2D AFR 0.695 0.727 2.68E-79 2.095 3.18E-266 
EUR 0.731 2.44E-91 2.263 1.46E-297 
PRS LD Panel R? Null R? R° difference Model Model P- 
Beta value 
Multi-ancestry 0.110 0.043 2.205 0 
BMI AFR 0.067 0.110 0.043 2.125 0 
EUR 0.108 0.042 2.198 0 


3.3. Determining the effect of linkage disequilibrium panel on PRS within ancestry 


In both AFR (Table 4) and EUR (Table 5) individuals, the addition of PRS to the covariate model 
enhances model performance. However, it is noteworthy that PRS performance was relatively 
stronger in EUR individuals compared to AFR individuals. In AFR, the full model shows a 
somewhat smaller improvement over the covariate-based model (average improvement in 
AUC=0.011) compared to the improvement observed in EUR (average improvement in 
AUC=0.021). 


Notably, in AFR individuals, the PRS calculated using the multi-ancestry LD panel exhibited a 
higher effect size in four out of the five PRSs (DBP, SBP, T2D, and BMI) compared to the AFR 
LD panel (Table 4). This indicates the potential benefits of using a multi-ancestry LD panel to 
derive PRS in populations with diverse genetic backgrounds. 


1 AUC rounded to three decimal points 
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Table 4. Comparison of LD panel for PRS in AFR individuals 


PRS LD Panel AUCNull AUC DeLong P Model Model P- 
OR value 
AFR : 1.33E- 1.261 2.75E-1 
CAD 0.764 0.770 33E-06 6 75E-18 
Multi-ancestry 0.770 4.52E-06 1.253 2.45E-17 
AFR 0.797 1.72E-05 1.208 4.56E-15 
DBP ; 0.793 
Multi-ancestry 0.797 1.25E-05 1.214 2.56E-15 
AFR 0.797 3.82E-06 1.252 3.00E-18 
SBP : 0.793 
Multi-ancestry 0.797 1.11E-06 1.277 9.65E-20 
71 .03E-2 1. .73E- 
TD ae 0.681 0.710 3.03E-25 630 5.73E-77 
Multi-ancestry 0.711 4.21E-26 1.689 1.73E-79 
R? Null R? R° difference Model Model P- 
PRS LD Panel Beta value 
AFR 0.065 0.024 1.449 1.02E-59 
BMI , 0.041 
Multi-ancestry 0.063 0.022 1.462 6.84E-56 


In EUR individuals, the PRS calculated using the multi-ancestry LD panel demonstrated a higher 
effect size in two out of the five PRSs (SBP and T2D) when compared to the EUR LD panel (Table 
5). This observation highlights the potential advantages of leveraging a multi-ancestry LD panel 


in deriving PRS for certain phenotypes in populations with European ancestry. 


Table 5. Comparison of LD panel for PRS in EUR individuals 


PRS LD Panel AUC Null AUC DeLong P Model Model P- 
OR value 
EUR .812 .49E-4 1. .65E-166 
CAD U 0.796 0.8 9. 49E-48 533 5.65 
Multi-ancestry 0.812 2.38E-48 1.531 5.73E-165 
; .17E-11 1.1 .17E-34 
DBP FUR 0.747 0.750 6.17 73 9.17E-3 
Multi-ancestry 0.750 1.51E-12 1.158 9.43E-29 
EUR ; .64E-21 1.251 1.49E-64 
SBP 7 : 0.747 eee oe 
Multi-ancestry 0.753 1.61E-20 1.255 2.40E-66 
EUR 0.708 8.26E-87 1.721 5.68E-243 
T2D : 0.651 
Multi-ancestry 0.710 1.12E-82 1.757 8.59E-258 
R? Null R? R? difference Model Model P- 
PRS LD Panel Beta value 
EUR 0.076 0.070 1.637 0 
BMI 0.006 
Multi-ancestry 0.075 0.069 1.626 0 
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3.4 PheWAS of polygenic risk scores 


We conducted a PheWAS of each multi-ancestry LD panel PRS in AFR and EUR individuals, 
identifying additional phenotypes associated with the PRS for our primary phenotypes (Figure 1, 
full results in Supplemental Tables Online: https://shorturl.at/uBDSX). The results reveal 
significant associations between the PRS and various phenotypic categories, shedding light on the 
potential implications of PRS in predicting disease susceptibility. All PRS exhibited associations 
with other phenotypes. However, in AFR individuals, the strength and number of PRS associations 
with other phenotypes were generally reduced compared to EUR individuals. 


In our analysis, the CAD PRS in AFR individuals was associated with 18 distinct phenotypes, 
including notable associations with hyperlipidemia (OR=1.12, p=1.1x10°°) and renal failure 
(OR=1.12, p=1.0x10°). In contrast, EUR individuals exhibited associations with a broader range 
of 82 phenotypes, with hyperlipidemia (OR=1.23, p=7.3x10-45) and renal failure (OR=1.10, 
p=2.1x10°°) being among them. 


For the DBP and SBP PRS, AFR individuals showed associations with 9 and 20 phenotypes 
respectively. Specific associations of interest included atrial fibrillation for DBP (OR=1.20, 
p=1.4x10°) and both coronary atherosclerosis (OR=1.20, p=3.7x107) and T2D (OR=1.12, 
p=3.2x10°>) for SBP. EUR individuals, on the other hand, had DBP and SBP PRS associated with 
12 and 27 phenotypes, respectively. This encompassed associations like coronary atherosclerosis 
for both DBP (OR=1.09, p=4.9x10°’) and SBP (OR=1.13, p=1.6x10°!), and T2D specifically for 
SBP (OR=1.17, p=1.0x10"!’). 


The T2D PRS in AFR individuals was linked with a vast array of 84 phenotypes. Key associations 
here were hyperlipidemia (OR=1.30, p=6.0x10°'°), obesity (OR=1.20, p=6.6x107!°), and 
hypertension (OR=1.22, p=4.5x10°°). EUR individuals had a slightly lesser range with 78 
phenotypes, but with significant associations like hyperlipidemia (OR=1.31, p=9.2x10°!), obesity 
(OR=1.29, p=9.9x10°”), and hypertension (OR=1.22, p=3.2x10°°8). Lastly, the BMI PRS in AFR 
was associated with 19 phenotypes, including T2D (OR=1.17, p=1.6x10) and hypertension 
(OR=1.18, p=8.6x10°%). In EUR individuals, this PRS was linked with a more extensive 72 
phenotypes, with notable associations being T2D (OR=1.26, p=4.6x10°") and hypertension 
(OR=1.19, p=2.2x10-32). 
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Figure 1. Phenome-wide Association Study (PheWAS) Results for Polygenic Risk Scores (PRS) for coronary artery 
disease (CAD), Diastolic Blood Pressure (DBP), Systolic Blood Pressure (SBP), Type 2 Diabetes (T2D), and Body 
Mass Index (BMI). The x-axis represents the phecode categories, and the y-axis shows the -log10 p-values, color- 
coded by category. 
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4. Discussion 


We generated five polygenic risk scores representing genetic liability for cardiometabolic diseases 
and assessed their performance across different ancestry groups in the Penn Medicine BioBank 
(PMBB), a biobank including DNA linked with electronic health records. For all PRS tested, we 
identified a statistically significant association with the primary phenotype in both ancestry groups, 
as validated by the DeLong test comparing the null and the full model. 


Type 2 diabetes consistently exhibited the highest effect size, reflecting the large number of cases 
in the GWAS used to generate this PRS and the PMBB dataset. Contrarily, the hypertension PRSs 
(DBP and SBP) showed a weaker effect size, even with a larger GWAS and over 50% of PMBB 
patient participants with hypertension. These observations suggest that factors beyond sample size, 
such as disease heterogeneity, prevalence, and non-additive effects, influence PRS associations. 
Consequently, understanding the interplay of these factors will be pivotal in refining and 
optimizing the application of PRS in disease prediction and risk assessment. 


Our PheWAS analyses were conducted to explore the broader phenotypic landscape associated 
with each PRS with an EHR-linked biobank. Many of the identified phenotypes could be linked 
to broader effects of known disease risk factors and established comorbidities. For instance, risk 
for Type 2 diabetes was associated with hypertension, a known commonly co-occurring trait”. 
Similarly, the BMI PRS was associated with sleep apnea, diabetes, and hypertension, all of which 
are known to be more prevalent in individuals with higher BMI°? **. However, these associations 
don’t necessarily imply causality. The high prevalence of comorbidities among these phenotypes 
complicates the task of discerning whether the genetic risk for one condition directly influences 
the onset of another. 


Our findings underscore a significant challenge in the future implementation of PRS into routine 
clinical care. While PRS derived from multi-ancestry GWAS can be associated with phenotypes 
in individuals of African ancestry (AFR), their impact is not as pronounced as those generated in 
European ancestry (EUR). This observation, although expected, has been a topic of extensive 
discussion in recent years, emphasizing a notable disparity in genetic research'*'’. Our results here 
affirm that these expectations persist even in large-scale, diverse ancestry datasets. Furthermore, 
our study suggests that PRS for cardiometabolic diseases based on multi-ancestry GWAS data 
might not perform as robustly for the primary disease and its associated secondary cardiometabolic 
traits. 


Our utilization of a multi-ancestry LD panel to compute PRS for all individuals from multi- 
ancestry GWAS demonstrated robust performance across all populations. This was especially true 
for African ancestry individuals, emphasizing the potential advantages of leveraging a multi- 
ancestry reference panel in PRS generation. As the field of precision medicine continues to evolve, 
advocating for the adoption of such panels becomes increasingly important. By addressing these 
challenges, we can pave the way for more inclusive and accurate personalized healthcare 
strategies. 
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One notable limitation of our study is the modest gain in predictive performance over the null 
model across all categories, as reflected in the AUC values. While we observed differences in 
AUC between the ancestry groups, the absolute increase in AUC over the null model was relatively 
small. This underscores the need for further refinement in PRS methodologies to achieve more 
substantial improvements in predictive performance. Additionally, in our PheWAS approach, 
there are inherent challenges when comparing results between AFR and EUR groups. The 
difference in sample sizes between these groups can lead to variations in statistical power, 
potentially influencing the observed associations. Moreover, the generally lower PRS performance 
in the AFR group, as highlighted in our results, can further compound these challenges. It's 
essential to interpret the PheWAS results with these considerations in mind. 


In conclusion, while there’s considerable enthusiasm surrounding PRS in clinical care, there 
remains a significant amount of research to be conducted to determine its optimal 
implementation. It is essential to explore how PRS can be incorporated alongside other commonly 
used predictors*3, such as family history, clinical comorbidities, and environmental/lifestyle 
factors. By combining PRS with established clinical guidelines, we can aim for a more 
comprehensive risk assessment, leading to personalized interventions. Another important issue to 
address is whether we will ultimately need ancestry-specific PRS models or if we can develop the 
statistical framework to integrate global and local LD patterns into the PRS model to produce a 
cosmopolitan PRS approach. For clinical implementation, a cosmopolitan PRS approach will be 
easier for clinicians to adopt; however, it is unclear how this can be done effectively, given the 
heterogeneity in LD patterns, effect sizes, and causal variants in different ancestry groups. Our 
work here suggests that the use of multi-ancestry GWAS and LD panels may be a step towards 
this goal. The ultimate success of PRS in precision medicine lies in integrating it seamlessly with 
published clinical guidelines and incorporating an individual's ancestry within the PRS framework. 
This integration will empower clinicians to make informed decisions based on a comprehensive 
and personalized risk profile for each patient. By addressing these key aspects and enhancing our 
understanding of PRSs role in precision medicine, we can unlock its full potential as a 
transformative tool in healthcare, facilitating early interventions and preventive measures that cater 
to each individual's unique genetic makeup and health needs. 
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High throughput profiling of multiomics data provides a valuable resource to better un- 
derstand the complex human disease such as cancer and to potentially uncover new sub- 
types. Integrative clustering has emerged as a powerful unsupervised learning framework 
for subtype discovery. In this paper, we propose an efficient weighted integrative cluster- 
ing called intCC by combining ensemble method, consensus clustering and kernel learning 
integrative clustering. We illustrate that intCC can accurately uncover the latent clus- 
ter structures via extensive simulation studies and a case study on the TCGA pan can- 
cer datasets. An R package intCC implementing our proposed method is available at 
https://github.com/candsj/intCc. 


Keywords: Integrative clustering; Consensus clustering; Multiomics data; Ensemble learn- 
ing. 


1. Introduction 


Recent advancements in high throughput technologies have enabled rapid profiling of different 
omics data, including genomics, epigenomics, transcriptomics, proteomics and metabolomics 
which allow for in-depth study of the complex regulatory patterns from a systems biology per- 
spective. For example, the Cancer Genome Atlas (TCGA) has generated over 2.5 petabytes of 
multiomics data. Such datasets offer the opportunity to explore the heterogeneity underpin- 
ning diseases such as cancer via unsupervised learning based on clustering framework, which 
could help define cancer subtypes, bringing us a step closer towards personalized medicine. 

In multimodal data structure, e.g., the different omics data, a key challenge in data analysis 
is in identifying the most appropriate approach for data integration. For unsupervised cluster- 
ing over multimodal data, these include the choice of a single step versus two-step approach. 
A single step approach is also known as joint modeling which combines all datasets together. 
Two-step approach works by clustering each dataset separately, followed by integration of 
these clusters. 

A number of integrative clustering methods and tools have been proposed to date. This 
includes Bayesian Consensus Clustering (BCC'), iCluster,? iClusterPlus,? Cluster Of Clus- 
ters Analysis (COCA“), Clusternomics® and kernel learning integrative clustering (KLIC°). 


© 2023 The Authors. Open Access chapter published by World Scientific Publishing Company and 
distributed under the terms of the Creative Commons Attribution Non-Commercial (CC BY-NC) 
4.0 License. 
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BCC, Clusternomics and iClusterPlus are based on Bayesian modeling framework and rely on 
Markov Chain Monte Carlo (MCMC) algorithm for fitting the model. These methods also as- 
sume that the probability model for each dataset is specified. However, softwares for BCC and 
Clusternomics currently only implement the algorithms for Gaussian distributed dataset, thus 
limiting the applicability of these methods to non-Gaussian datasets such as SNPs, mutation 
or copy number datasets. 

On the other hand, iCluster works by assuming a Gaussian latent variable model for infer- 
ring the cluster structures, whereas iClusterPlus increases the versatility of iCluster by incor- 
porating statistical models for continuous, binary, multinomial count datasets via a Bayesian 
latent variable model and employs MCMC algorithm for sampling from its posterior distri- 
bution for statistical inference. However, software implementation of iClusterPlus currently is 
limited to integrative clustering of at most four datasets. Since the model involves tuning a 
number of parameters, the bottleneck is the computational time when the number of datasets 
or features increases. 

Another popular integrative clustering approach is COCA* which was first introduced to 
define cancer subtypes by clustering six different datasets, namely DNA copy number, DNA 
methylation, mRNA expression, microRNA expression, protein expression, and somatic point 
mutation. COCA works by first clustering each dataset using consensus clustering,’ followed 
by clustering the binary matrix generated by aggregating the clusters obtained from each 
dataset. While this approach is robust and easily scalable to a large number of datasets, a 
limitation of COCA is that all datasets contribute equally to the final clustering which affects 
the accuracy of the clusters obtained, especially in scenario in which certain dataset is less 
reliable. 

Taking inspiration from COCA and multiple kernel learning,®:? KLIC! was developed to 
address the pitfall of COCA. Similar to COCA, KLIC works by first applying consensus 
clustering to each dataset. The authors proved that these consensus matrices are positive 
semi-definite kernels, which can then be used as input in multiple kernel k-means clustering 
and allows for weights to be estimated for each kernel via a two-step optimization strategy 
and convex quadratic programming. This approach allows for more informative dataset to 
contribute more to the overall clustering. Currently, KLIC runs one clustering algorithm on 
each dataset to generate the consensus matrix. 

In this paper, we seek to extend the KLIC framework to a more robust integrative clustering 
by proposing a two layer weighted integrative clustering which allows for more than one 
clustering algorithm to be run on each dataset, i.e, ensemble clustering and aggregated together 
via an efficient weight estimation. 


2. Methods 


Our proposed method can be viewed as a combination of (a) ensemble clustering, i.e, aggre- 
gating multiple clustering algorithms, (b) consensus clustering, i.e., resampling, and (c) kernel 
learning integrative clustering. While some papers use ensemble and consensus clustering 
interchangeably, in this paper, we refer to ensemble clustering as a collection of multiple clus- 
tering algorithms, e.g., k-means, hierarchical clustering or partitioning around medoid (PAM), 
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whereas consensus clustering as a framework which draws a random sample from either the 
sample or feature space. We now briefly describe the consensus clustering and kernel learning 
integrative clustering framework. 

Consensus clustering was originally proposed by Monti et al (2003).’ The main idea behind 
consensus clustering is to apply a resampling scheme on the sample or feature dimension under 
the assumption that different subsamples drawn from the dataset should not differ much in 
the clustering results. The resampling scheme allows one to assess the stability of the cluster 
assignments and the robustness of the dataset to perturbations, thus could aid in deriving a 
more stable and reliable result that reveals the real structure underlying the dataset. 

A key element derived from the consensus clustering is the consensus matrix which mea- 
sures the agreement among samples. For a dataset with N samples, the consensus matrix M 
is a N x N matrix whose element M(i,7) denotes the proportion of sample i and sample j 
in the same cluster during the resampling iterations. Values which are close to 1 (and vice 
versa 0) indicate that the two samples are always assigned to the same cluster (and vice versa 
different clusters). 1 — M is a distance measure which can be used to derive a final clustering 
result. 

Cabassi and Kirk (2020)° proved that the consensus matrix is positive semi-definite and 
thus can be used as input in kernel learning integrative clustering via the application of mul- 
tiple kernel k-means algorithm. The kernel k-means algorithm utilizes the kernel trick by 
projecting the data into a non-linear feature space via a kernel. This overcomes the drawback 
of regular k-means clustering which cannot identify clusters that are not linearly separable in 
the original input space. The integration of the multimodal data within the kernel learning 
integrative clustering involves a convex sum of the kernels, i.e., consensus matrix from each 
dataset, and the estimation of the weights in the convex sum. In the KLIC integrative clus- 
tering algorithm of Cabassi and Kirk (2020),° the authors adopted the optimization strategy 
proposed by Gonen and Margolin (2014)!° which involves a convex quadratic programming. 

In this paper, we reason that the weights in the kernel learning integrative clustering can 
be estimated by utilizing the fuzziness in the consensus matrix. Furthermore, we extend the 
framework of KLIC by allowing multiple base clustering algorithms, e.g., k-means, hierarchical 
clustering, PAM, to be applied within each dataset and aggregated, i.e., ensemble clustering!! 
which has been shown to enhance the robustness of clustering results compared to individual 
clustering algorithm. To this end, we propose an efficient weight estimation method and a two 
layer weighted integrative consensus clustering. 


2.1. Weight estimation 


The consensus matrix can be used to assess cluster stability and composition. As a motivating 
example, we generate two datasets, each with 10 features and 100 samples. For both datasets, 
we assume that there are 3 clusters with cluster sizes 20, 30 and 50. All the features are gen- 
erated from the Gaussian distribution. For dataset 1, 9 out of the 10 features are informative, 
where the means of cluster 1, 2 and 3 are 1, -1 and 0, respectively with unit variance. For 
dataset 2, 3 out of the 10 features are informative, where the means of cluster 1, 2 and 3 are 
0.2, -0.2 and 0, respectively with unit variance. Non-informative features are generated from 
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standard Gaussian distribution. We designate datasets 1 and 2 as having high and and low 
signal-to-noise-ratio (SNR), respectively and run consensus clustering on both datasets using 
100 iterations of k-means and resampling 80% of samples and features in each iteration. Figure 
1 shows the heatmaps of the consensus matrices. The diagonal blocks plot the in-cluster val- 
ues, whereas the off diagonals blocks plot the out-of-cluster values. For the low SNR dataset, 
the off diagonal blocks are much noisier compared to the high SNR dataset. We argue that 
this can be used to derive the weights in the multiple kernel integrative clustering. Specifically, 
we define the weights based on the ratio of in-cluster proportion to out-of-cluster proportion 
using the cluster estimated by the algorithm itself. Clustering result closer to the real struc- 
ture tends to have higher in-cluster proportion and lower out-of-cluster proportion. In other 
words, datasets with a higher ratio of in-cluster proportion to out-of-cluster proportion will 
be assigned larger weights. 
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Fig. 1. Heatmaps of consensus matrices for high and low signal-to-noise ratio (SNR) datasets. True 
cluster membership is given in the annotation above each heatmap. Predicted cluster membership 
corresponds to the three gap-separated blocks in each heatmap. 


Without loss of generality, we consider P consensus matrices M,,...,Mp for number of 
clusters K. Here, the consensus matrices could arise by applying different clustering algorithms 
to the same dataset or could denote consensus matrices derived from different datasets. We 
further define: 

W? (k): in-cluster proportion for cluster k of consensus matrix Mp. 

Wg (k): out-of-cluster proportion for cluster k of consensus matrix Mp. 

W? : average in-cluster proportion across all clusters of consensus matrix Mp. 
W? ,: average out-of-cluster proportion across all clusters of consensus matrix Mp. 


Rp: ratio of in-cluster proportion to out-of-cluster proportion for consensus matrix Mp. 
Wp: weight for consensus matrix Mp. 
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We propose calculating the weights as follows: 
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In practice, true cluster membership is unknown, thus the weights will be computed based 
on predicted cluster membership. Using this formula, Wı = 0.726 and W2 = 0.274 for the 
consensus matrices derived based on predicted cluster membership of the two datasets above. 


2.2. Two Layer Weighted Integrative Consensus Clustering 


We now describe our proposed two layer weighted integrative consensus clustering. We assume 
that there are D datasets, X1,...,.X%p, and number of clusters K. 


Layer 1: For each dataset X4 where d = 1,2, ..., D: 


(1) Perform ensemble clustering using P different clustering methods, where p = 1,2, ..., P. 
This will generate consensus matrices M4, where p = 1,2, ..., P. 

(2) Compute the weights wł, w4, ..., w% for each consensus matrix M4, M4, ..., MS. 

(3) Define the weighted consensus matrix Mt signi a M4 eigni = Dopa WI x MÅ. 

(4) Apply a clustering algorithm, e.g., PAM or hierarchical clustering, to each weighted con- 
sensus matrix M? 


weight’ 

Layer 2: 

(1) For the weighted consensus matrix M} eign Meiga > MDeig Compute the weights 
Wi, Wa, ..., Wp. 

(2) Define the weighted of weighted consensus matrix M weight AS Mweight = aay Wa XMS eight: 


(3) Apply a clustering algorithm, e.g., PAM or hierarchical clustering, to Mweignt to derive a 
final clustering result. 


We provide a flowchart in Figure 2 summarizing our proposed two layer weighted integra- 
tive consensus clustering. Our method is implemented as a GitHub R package intCC available 
at https://github.com/candsj/intCC. 


3. Simulation studies 


We conduct simulation studies to compare the performance of our proposed two layer weighted 
integrative consensus clustering intCC against other integrative clustering methods which are 
implemented for both Gaussian and non-Gaussian distributed datasets, namely KLIC® and 
iClusterPlus.? 
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Dataset 2 Dataset 3 
Ensemble clustering with different clustering algorithms 
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Weighted consensus matrix M'weig = M1'w41+ Mo'Wo' + Mawa! MP weight = M42w42+ M2?W3? MB yeignt = MySw43+ MaW + Ma3W33 
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based on Mi eight: Me weights MB weight 
Weighted of weighted consensus matrix Myeight = M'weight W4 + M?weight W2 + MB weight W3 
| Clustering 


Final integrative cluster assignment to each 
sample 


Fig. 2. Flowchat describing our proposed algorithm. 


3.1. Datasets 


Unlike Cabassi and Kirk (2020) which only considered data simulated from Gaussian distri- 
butions, we follow the strategy of Mo et al. (2013)? where we generate datasets from different 
distributions, including Gaussian (e.g., M-values from DNA methylation, microarray data 
such as gene expression), binomial (e.g., somatic mutations), Poisson (e.g., count data from 
sequencing technologies such as RNA-Seq data or copy number data represented as number 
of copies gained or lost) and multinomial (e.g., copy number data states represented as gain, 
normal or loss, or SNP data) distributions. This is to ensure that our proposed method is 
applicable to integration of continuous, binary, count and categorical types of datasets. For 
Settings 1-6, we set the sample size and the true number of clusters to be 60 and 3, respec- 
tively in which each cluster consists of 20 samples. We vary the number of informative and 
non-informative, i.e., noise features. The parameters used in our simulations for Settings 1-6 
are provided in Supplementary Table 1. Settings 7-9 follow from the simulation setup of of 
Cabassi and Kirk (2020).° We consider several simulation settings, namely: 


(1) Setting 1: 4 datasets following Gaussian, binomial, Poisson and multinomial distribution, 
respectively. Each dataset has 30 features, in which 15 features are informative and the 
rest are noise features. 

(2) Setting 2: 4 datasets includes normal, binomial, Poisson and multinomial distribution, 
respectively. Each dataset has 30 features, in which 15 features are informative and the 
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rest are noise features. Informative features have slighly lower signal compared to the 
Setting 1. 

Setting 3: 3 datasets following Gaussian, binomial and Poisson distribution, respectively. 
Each dataset has 30 features, in which 15 features are informative and the rest are noise 
features. 

Setting 4: 5 datasets following Gaussian, binomial, Poisson, multinomial and Gaussian 
distribution, respectively. Each dataset has 30 features. For the first 4 datasets, 15 fea- 
tures are informative and the rest are noise features. The 5th dataset follows a Gaussian 
distribution in which all features are noise features. 

Setting 5: 4 datasets following Gaussian, binomial, Poisson and multinomial distribution, 
respectively. Each dataset has 500 features, in which 100 features are informative and the 
rest are noise features. 

Setting 6: 4 datasets following Gaussian, binomial, Poisson and multinomial distribution, 
respectively. Each dataset has 500 features, in which 250 features are informative and the 
rest are noise features. 

Setting 7: 4 datasets following Gaussian distribution with similar parameter setting. Each 
dataset consists of 300 samples with 6 clusters of size 50 samples each. There are 2 features 
with no noise feature. For cluster k, u = k x (separation level—1)/2,¢0 = 1, k = 1,2,3,4,5,6. 
Separation level = 4 is used in this setting. 

Setting 8: 4 datasets following Gaussian distribution with different parameter setting. Each 
dataset consists of 300 samples with 6 clusters of size 50 samples each. There are 2 features 
with no noise feature. For cluster k, u = k x (separation level —1)/2,¢0 = 1, k = 1,2,3,4,5,6. 
Varying separation levels = 1,2,3,4 are used in this setting. Only 3 datasets are used as 
input. We consider 4 dataset combinations, namely 123, 124, 134, 234. Here 123 implies 
that the clustering algorithms are applied to only datasets 1, 2 and 3. 

Setting 9 (nested cluster structure): 2 datasets following Gaussian distribution, in which 
each dataset consists of 300 samples. There are 2 features with no noise feature. Dataset 
1 has 6 clusters of size 50 samples each. Dataset 2 has 3 clusters of size 100 samples each. 
For cluster k, u = k x (separation level — 1)/2,0 = 1, k = 1,2,3,4,5,6 for dataset 1 and 
k = 1,2,3 for dataset 2. Separation level = 4 is used in this setting. 


Each setting is repeated 100 times. Additional simulation settings including multivariate Gaus- 
sian distribution are provided in Supplementary Material. 


3.2. Clustering algorithms 


We apply several clustering strategies based on our proposed method intCC, KLIC® and iClus- 
terPlus.? To evaluate the advantage of ensemble clustering, i.e., applying multiple clustering 
algorithms to each dataset, we also include our proposed method which only runs a single clus- 
tering algorithm to each dataset. We denote this as one layer weighted integrative consensus 
clustering. We also compare application of PAM and hierarchical clustering to the weighted 
consensus matrix in deriving a final clustering result. These methods are denoted as: 


(1) 


iClusterPlus: applying iClusterPlus with the data type specified. 
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(2) KLIC-k-means: KLIC by applying k-means to each dataset for generating the consensus 
matrix. 

(3) KLIC-Hclust: KLIC by applying hierarchical clustering to each dataset for generating the 
consensus matrix. 

(4) 1 layer intCC-k-means (PAM): One layer weighted integrative consensus clustering by 
applying k-means to each dataset for generating the consensus matrix, followed by PAM 
to derive a final clustering result. 

(5) 1 layer intCC-Hclust (PAM): One layer weighted integrative consensus clustering by ap- 
plying hierarchical clustering to each dataset for generating the consensus matrix, followed 
by PAM to derive a final clustering result. 

(6) 1 layer intCC-k-means (Hclust): One layer weighted integrative consensus clustering by 
applying k-means to each dataset for generating the consensus matrix, followed by hier- 
archical clustering to derive a final clustering result. 

(7) 1 layer intCC-Hclust (Hclust): One layer weighted integrative consensus clustering by ap- 
plying hierarchical clustering to each dataset for generating the consensus matrix, followed 
by hierarchical clustering to derive a final clustering result. 


To obtain an unbiased comparison to our two layer approach, we also apply KLIC with 
multiple clustering algorithms. In other words, suppose there are 4 datasets and two clustering 
algorithms are applied to each dataset, there will be a total of 8 consensus matrices, i.e., akin 
to applying KLIC to 8 datasets. KLIC is applied using these 8 consensus matrices as input 
in the multiple kernel integrative clustering. Additionally, to illustrate the advantage of two 
layer approach, we also include another one layer approach in which we apply a single layer 
weight estimation to the 8 consensus matrices. These methods are denoted as: 


(8) 2 layer intCC-2 methods (PAM): Two layer weighted integrative consensus clustering 
by applying both k-means and hierarchical clustering to each dataset for generating the 
consensus matrices, followed by PAM to derive a final clustering result. 

(9) 2 layer intCC-2 methods (Hclust): Two layer weighted integrative consensus clustering 
by applying both k-means and hierarchical clustering to each dataset for generating the 
consensus matrices, followed by hierarchical clustering to derive a final clustering result. 

(10) KLIC-2-methods: KLIC by applying both k-means and hierarchical clustering to each 
dataset for generating the consensus matrices. 

(11) 1 layer intCC-2 methods (PAM): One layer weighted integrative consensus clustering 
by applying both k-means and hierarchical clustering to each dataset for generating the 
consensus matrices, followed by PAM to derive a final clustering result. 

(12) 1 layer intCC-2 methods (Hclust): One layer weighted integrative consensus clustering 
by applying both k-means and hierarchical clustering to each dataset for generating the 
consensus matrices, followed by hierarchical clustering to derive a final clustering result. 


For Settings 1-8, we apply each method by setting the number of clusters to be the true 
number of clusters. In practice, one can tune the optimal number of clusters using criteria 
such as the silhouette method,!? gap statistics,!? Dunn index! or the delta K method.’ For 
Setting 9, we consider (a) global clustering, where we set the number of clusters to be the 
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same throughout for both individual dataset and final integrative clustering, i.e., either 3 or 6 
throughout (we denote these strategies as “Global K=3” and “Global K=6” ), and (b) separate 
clustering, where we use the true number of clusters for individual dataset, i.e., 6 for dataset 
1 and 3 for dataset 2, and consider both K = 3 and K = 6 in the final integrative clustering 
(we denote these strategies as “Separate K=3” and “Separate K=6”). Additionally, due to the 
poor performance of iClusterPlus and the long computational time, we omit iClusterPlus for 
Settings 4-6. We compare the performance of the clustering methods via the average adjusted 
rand index (ARI). We also report the weight estimation time of intCC and KLIC. 


3.3. Results 


We summarize the ARI for each simulation setting in Figure 3. Overall, results show that 
our proposed methods, namely 2 layer intCC-2 methods (PAM) and 1 layer intCC-k-means 
(PAM) perform well across all simulation settings. To explain this observation, without loss 
of generality, we summarize the ARI within each simulated dataset of Setting 4 in Figures 
4A and 4B. The ARI by applying k-means as the base algorithm in the consensus clustering 
within each dataset is significantly better than hierarchical clustering in the simulated datasets 
considered in this paper. Thus, it is not surprising that methods which use k-means as the 
base clustering algorithm in the consensus clustering yield better performance. However, in 
practice the best base clustering algorithm is sometimes unknown. Thus, the 2 layer intCC 
which aggregates multiple base clustering algorithms can automatically assign higher weights 
to the better algorithm as shown in our simulation studies, as evident from the estimated 
weights in Figures 4C and 4D. It is also worth noting that our method assigns significantly 
smaller weights to the 5th dataset in which all the features are noise features. Additionally, 
using PAM to derive a final clustering result in general yields better performance compared 
to hierarchical clustering. We also note that the performance of iClusterPlus is significantly 
poorer compared to other methods, consistent with the findings of Cabassi and Kirk (2020).° 
Moreover, extending KLIC to run multiple base clustering algorithms, i.e., KLIC-2-methods 
has lower ARI compared to our proposed method, implying that the current KLIC framework 
does not yield a straightforward extension to incorporate ensemble clustering. 

Without loss of generality, we also report the weight calculation time for KLIC and our 
proposed method intCC for Setting 1 (60 samples) and Setting 7 (300 samples) in Table 1, 
which shows that our proposed weight calculation is computationally efficient and yields good 
operating characteristics. 


4. Case study 


We illustrate our proposed method intCC on the TCGA pan cancer datasets.!° There are 5 
datasets across 12 cancer types which represent different tissues of origin, including DNA copy 
number, DNA methylation, mRNA expression, microRNA expression and protein expression 
data. To minimize bias in the comparison, we use the same preprocessing pipeline as previously 
described.®1° 
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Fig. 4. A-B. Distribution of ARI within each simulated dataset of Setting 4. C-D. Distribution of 
estimated weights from intCC within each simulated dataset of Setting 4. Purple (green) boxplots 
are results by applying k-means (hierarchical clustering) algorithm in the consensus clustering. A, 
C. Using PAM to derive a final clustering result. B, D. Using hierarchical clustering to derive a final 
clustering result. 


Table 1. Weight calculation time comparison. 


Method Setting 1 (seconds) Setting 7 (seconds) 
KLIC-k-means 0.541 7.209 
KLIC-Hclust 0.791 8.241 

1 layer intCC-k-means (PAM) 0.000879 0.00330 
1 layer intCC-Hclust (PAM) 0.000882 0.00335 
1 layer intCC-k-means (Hclust) 0.000876 0.00333 
1 layer intCC-Hclust (Hclust) 0.000909 0.00332 
2 layer intCC-2 methods (PAM) 0.00273 0.0103 
2 layer intCC-2 methods (Hclust) 0.00265 0.0102 
KLIC-2-methods 2.428 27.592 
1 layer intCC-2 methods (PAM) 0.00155 0.00673 
1 layer intCC-2 methods (Hclust) 0.00152 0.00686 


Cabassi and Kirk (2020)° followed the same procedures described in Hoadley et al. (2014)!° 
in setting the number of clusters for each dataset, except for microRNA expression in which 
the authors identified 8 as the number of clusters. We also set the number of clusters for each 
dataset following Cabassi and Kirk (2020). Subsequently, we apply our proposed method 
intCC to obtain an integrative clustering across these datasets using the PAM algorithm to 
derive a final clustering result. Our method also selects 10 as the optimal number of clusters 
based on the average silhouette criterion, similar to KLIC.’ Figure 5A compares the cluster 
membership of our method intCC against the results of KLIC, with ARI 0.693, whereas 
Figures 5B and 5C compare the cluster membership of intCC and KLIC against the 12 cancer 
type annotation, respectively. The ARI between intCC and cancer type annotation associated 
with tissues of origin is 0.754, whereas the ARI between KLIC and cancer type annotation is 
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0.585, indicating that the cluster membership of intCC yields a higher consistency with tissues 
of origin in the TCGA pan cancer datasets. Further investigation into the clusters obtained 
by intCC versus KLIC among subset of breast invasive carcinoma (BRCA) indicates that 
the results from intCC yield a higher consistency with the TCGA-BRCA molecular subtypes 
compared to the results from KLIC (Supplementary Material). 

The estimated weights of each dataset for intCC and KLIC are (DNA copy number, DNA 
methylation, mRNA expression, miRNA expression, protein expression) =(0.073, 0.401,0.045, 
0.272, 0.209) and (0.309, 0.192, 0.168, 0.183, 0.148), respectively. intCC assigns a higher weight 
to DNA methylation data, whereas KLIC assigns a higher weight to the copy number data, 
which could explain the differences observed in cluster memberships obtained by these two 
methods. Finally, the weight calculation time for intCC is 0.43 second, whereas the weight 
calculation time for KLIC via quadratic programming is > 10 hours on an Intel(R) Xeon(R) 
CPU E5-1650 v3 @ 3.50GHz. 
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Fig. 5. Heatmaps of coincidence matrices comparing A. intCC clusters to KLIC clusters, B. intCC 
clusters to cancer type annotation, C. KLIC clusters to cancer type annotation. The ARI is reported 
in the header of each plot. 


5. Discussion 


The rapid development of high throughput technologies has provided an avenue to scientists 
to decipher the complex human diseases from a systems biology perspective via multiomics 
profiling. Integrative clustering has become a powerful approach to dissect the heterogeneity 
underpinning these diseases, e.g., to define new cancer subtypes which may help inform treat- 
ment efforts. In this paper, we extend the framework of KLIC® which recasts the integrative 
clustering model into multiple kernel learning framework by utilizing the consensus matrices 
estimated from consensus clustering as input. Specifically, our model further incorporates the 
ensemble learning via an aggregation of multiple base clustering algorithms to enhance the 
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robustness of multiple kernel integrative clustering model. This is to safeguard against ap- 
plying a single base clustering algorithm that performs poorly on the dataset. Additionally, 
we also propose an efficient weight estimation to combine the consensus matrices. Our simu- 
lation studies show that the proposed two layer weighted integrative clustering yields better 
performance overall. 

Conceptually, the weight estimation is analogous to the heuristics of multiple kernel sup- 
port vector machine (MKL-SVM) based on kernel-target alignment.'©'® Specifically, MKL- 
SVM is developed for supervised learning and the kernel-target alignment depends on the true 
binary class labels. For a fixed cluster membership, this is equivalent to multi-class classifi- 
cation. One can extend the kernel-target alignment for multi-class classification by dividing 
the problem into several binary classification subproblems (e.g., one-versus-all or all-pairs). 
However, how to optimally combine the results across these binary subproblems is not trivial 
and may require longer computational time compared to our proposed method. 

Besides identifying appropriate and robust clustering algorithms, another important re- 
search question in unsupervised learning is in tuning the optimal number of clusters. Several 
metrics have been proposed for this task, including the silhouette method,!? gap statistics,'® 
Dunn index" and the delta K method.’ An immediate extension to our intCC framework is 
to aggregate the different metrics/criteria for selecting the optimal number of clusters. 


Supplementary Material and Code 


Supplementary Material is available online at 
http: //www.ams.sunysb.edu/~pfkuan/PDF/SM_PSB2024. pdf. 
The R code implementing intCC is available online at https://github.com/candsj/intCCc. 
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Large Language Models (LLMs) are a type of artificial intelligence that has been revolutionizing 
various fields, including biomedicine. They have the capability to process and analyze large amounts 
of data, understand natural language, and generate new content, making them highly desirable in 
many biomedical applications and beyond. In this workshop, we aim to introduce the attendees to an 
in-depth understanding of the rise of LLMs in biomedicine, and how they are being used to drive 
innovation and improve outcomes in the field, along with associated challenges and pitfalls. 


Keywords: ChatGPT; large language model; LLM; generative AI; biomedicine and health; 
education; ethics. 


1. Background 


A language model (LM) is a machine learning technique for natural language processing tasks. 
LMs typically predict the probability of a word appearing in a text sequence based on the previous 
word, modeling linguistic intuition (like completing a missing word in a sentence). One of the key 
advances in LM was the introduction of the transformer architecture [1], which became the 
cornerstone for many of the large language models (LLMs) that followed. In brief, the transformer 
architecture includes two modules, namely, an encoder of bidirectional attention blocks and a 
decoder of unidirectional attention blocks. Based on which modules are used, the LLMs are 
classified as encoder-only (e.g., BERT, Bidirectional Encoder Representations from Transformers 
[2]), encoder-decoder (e.g., T5 , Text-to-Text Transfer Transformer [3]), or decoder-only (e.g., 


t Work partially supported by NIH award U24 HG007822. 
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GPT, Generative Pre-trained Transformer, series [4]—[6]). The latter class are able to use billions 
(or even trillions) of parameters and trained on massive amounts of unlabeled text, providing the 
ability to generate human-like text [7]. In addition to capturing the language, these models can 
“memorize” facts during training. Thus, LLMs have the capacity to efficiently handle and analyze 
extensive text data and generate fresh content, demonstrating significant promise in diverse 
applications. 


The launch of ChatGPT, the LLM-based chatbot developed by OpenAI [8], to the public in late 
2022 has sparked a number of exciting opportunities, but also some challenges and ethical 
concerns. It was recently reported that a keyword search for “large language models” OR 
“ChatGPT” in PubMed returned 582 articles by the end of May 2023 [9]. The same search 
conducted at the end of September 2023 returned 1,495 articles, more than doubled in a short 
period. Publications include research and review articles as well as relevant commentaries on how 
LLMs are reshaping biomedicine, healthcare and education [9|-[16]. The extent of LLM 
applications goes beyond language, with active research in the field of protein annotation [17], 
[18], function [19], and structure prediction [20]. While LLMs offer substantial benefits, it is 
important to acknowledge its limitations such as hallucinations and key ethical challenges 
including: perpetuating biases present in the training data, thus efforts are needed to ensure 
fairness and equity in their applications; privacy issues when handling sensitive data; transparency 
and plagiarism, among others. 


2. LLM and ChatGPT in Biomedicine Workshop 


Given the rapid evolution and dissemination of the LLMs, and more specifically ChatGPT, the 
proposed workshop aims at introducing and discussing latest developments in the first year 
surrounding this new technology in biomedicine. The workshop will consist of talks spanning the 
following topics: 


e Introduction to LLM Technology: This talk will provide an overview of LLMs, including their 
architecture, training process, and how they work. It will help attendees understand the basics 
of this technology and why it is relevant to biomedicine. 


e Use of Standard LLMs in Scientific Research: This talk will focus on the use of standard 
LLMs in research, and how they can support researchers in various ways, including helping 
design and analyze experiments, writing code, brainstorming, and writing papers. 


e Use of LLMs in the Education and Academic Writing: This talk will discuss the use of LLMs 
in the classroom, both for teachers and students. Highlighting the benefits of using LLMs in 
teaching and learning and provide examples (e.g., as a writing assistant) of how they are being 
used to enhance the educational experience. 


e Applications of LLMs in Healthcare: This talk will showcase the use of LLM technologies in 


custom methods development for existing/new problems in clinical informatics research and 
healthcare. 
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e Ethics of using LLMs: This talk will feature topics surrounding the use of such a chatbot 
technology in medical care and scientific research, including but not limited to privacy and 
ethical concerns, AI bias, and legal liabilities. 


3. Conclusion 


We envision that similarly to previous new technology disruptions in society (e.g., calculator, 
computer or the internet), LLMs will become integral part of our lives, and the discussions in this 
workshop will help to shape the landscape ahead. 
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Workshop Description 


In biomedical research and clinical medicine, many of the ethical frameworks and processes focus on benefits 
and harms at the individual level. However, in biomedicine, there is increasing recognition of a need to 
implement frameworks and processes that address the social impacts of technologies, such as genomics and AI 
technologies, and their social benefit for underrepresented populations and communities. For example, studies 
demonstrating the potential for bias in AI shed light on the need to develop processes to more effectively 
identify and address downstream impacts of medical AI, as well as engage communities who are stakeholders 
in the research. Privacy is often envisioned as an individual right, but the collection and use of data also have 
repercussions at the level of groups and communities. For that reason, there have been recent efforts to arrive at 
models for data stewardship and data sovereignty. This workshop will provide a forum for discussion of 
practical approaches to enhancing fairness, social responsibility and inclusion of diverse viewpoints in 
biomedicine. Interdisciplinary research on ethics and how fairness, social responsibility, and community 
engagement can be operationalized in biomedical research will provide a foundation for robust discussion on 
these issues. 


The 3-hour workshop will consist of two parts: 

e The first part will include a series of 15-minute talks that address fairness, social responsibility & 
inclusion/community engagement for different areas of biomedicine, followed by an audience Q& A 
and discussion of the topics such as diversity in precision medicine, ethical and sustainable data 
stewardship, and public engagement with social and behavioral genomics. 

e For the second half of the workshop, weare conducting an interactive exercise with the audience. 
Focusing on case studies, based on topics from the first half of the workshop, such as community 
engagement and data stewardship, we will use smartphone-based polling to facilitate feedback from the 
audience on approaches, challenges and solutions for addressing the ethical issues from the case study. 


Learning Objectives 


By the end of this workshop attendees will be able to: 


© 2023 The Authors. Open Access chapter published by World Scientific Publishing Company and distributed 
under the terms of the Creative Commons Attribution Non-Commercial (CC BY-NC) 4.0 License. 
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1. Understand the social and political context that underlays the need for frameworks and processes that 
more effectively address the impacts of these technologies on individuals and communities. 

2. Explore and analyze efforts to identify and address the downstream harms and benefits of biomedical 
technologies 

3. Locate actors that have the ability to mitigate the downstream harms of biomedical technologies and/or 
the ability to promote its downstream benefits. 


Presenter Information 


This workshop brings together rich and interdisciplinary perspectives from medical anthropology, biomedical 
engineering, education, and bioethics, as well as, legal perspectives. Importantly, our multidisciplinary and 
multi-institution workshop aims to do more than provide the PSB community with the opportunity to come 
together to analyze and evaluate efforts to enhance social responsibility and the inclusion of diverse viewpoints 
in biomedicine. We offer workshop attendees strategies for intervening to assist with promoting fairness, social 
responsibility, inclusion, and justice in biomedical research and practice. 


About the Workshop Organizers 


Daphne Martschenko, Ph.D., is an Assistant Professor at the Stanford University Center for Biomedical 
Ethics and a co-organizer of the international Race, Empire, and Education Research Collective. Dr. 
Martschenko holds an MPhil from the University of Cambridge in Politics, Development, and Democratic 
Education and in 2019 received a Ph.D. in Education, also from the University of Cambridge. Dr. 
Martschenko’s work advocates for and facilitates the ethical and responsible conduct of and public engagement 
with genetic/genomic research. 


Nicole Martinez-Martin, JD, Ph.D., is an Assistant Professor at the Stanford Center for Biomedical Ethics. 
She received her JD from Harvard Law School and her doctorate in social sciences (comparative 
development/medical anthropology) from the University of Chicago. Her broader research interests concern the 
impact of new technologies on the treatment of vulnerable populations. Her recent work in bioethics and 
neuroethics has focused on the ethics of AI and digital health technology, such as digital phenotyping or 
computer vision, for medical and behavioral applications. 


Meghan Halley, PhD, MPH, is a Senior Research Scholar in the Stanford Center for Biomedical Ethics 
(SCBE) at Stanford University. She completed her doctorate in medical anthropology from Case Western 
Reserve University in 2012, and additional training in health services research at the Palo Alto Medical 
Foundation Research Institute from 2012 through 2016. Her current research focuses at the intersection of the 
ethics and economics of new genomic technologies. Her current projects include examining ethical issues 
related to sustainability and governance of patient data and relationships when large clinical genomic studies 
transition to new models of funding; ethnographic work exploring how diverse stakeholders perceive value in 
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the use of genome sequencing for diagnosis of rare diseases; and the development of new measures for 
assessing patient-centered outcomes in pediatric rare diseases. 


Presentations 


e Daphne Martschenko, PhD, Assistant Professor in Biomedical Ethics, Stanford University: “Wrestling 
with Public Input on Social and Behavioral Genomics” reporting on scholarship gathering the 
perspectives of members of the public on the risks and potential benefits of social and behavioral 
genomics. 

e Mildred Cho, PhD, Professor in Biomedical Ethics, Stanford University, reporting on the use of 
hypothetical design exercises in order to examine values in biomedical AI/ML development 

e Meghan Halley, PhD, MPH, Senior Research Scholar in Biomedical Ethics, Stanford University: 
“Toward more ethical and sustainable data stewardship in rare disease research” reporting on the 
parameters of ethical data sharing and sustainability in rare disease research, involving perspectives on 
cloud-based genomic databases. 

e Krystal Tsosie, PhD, MPH, Assistant Professor, School of Life Sciences, Center for Biology and 
Society, Arizona State University: “Platforms Not Platitudes: Operationalizing Ethics and Advancing 
Indigenous Data and Digital Sovereignty” on community data governance and stewardship with digital 
data tools rooted in machine learning and dynamic consent e-platforms 

e Carole Federico, PhD, GSK.ai-Stanford Ethics Fellow, Stanford University: “Synthetic Data for 
Biomedicine: Epistemic and Ethical Challenges”. 

Interactive Hypothetical Design Case Study Presentation: 

Nicole Martinez-Martin, JD, PhD, Assistant Professor in Biomedical Ethics, Stanford University 
Mildred Cho, PhD, Professor in Biomedical Ethics, Stanford University 

Tiffany Bright, Co-Director Center for Artificial Intelligence Research 

Cedars-Sinai, Computational Biomedicine 


Speaker Presentations 


The speaker presentations will provide examples of how issues of diversity and inclusion, as well as social 
responsibility, are being engaged in the fields of genomics and machine learning in medicine. 


Genes, and the social narratives we tell about them, continue to grip the popular imagination. In particular, 
claims regarding genetic differences in human behavior and social outcomes have been a pervasive and often 
ugly feature of American society since the eugenics movement of the twentieth century. Today, researchers in 
the rapidly growing field of social/behavioral genomics investigate whether and how genetic differences 
between individuals relate to differences in behaviors (e.g., aggressive behavior) and social outcomes (e.g., 
educational attainment), as well as how genetic information can inform the design of social/behavioral studies. 
There is staunch and polarizing academic debate about the risks and benefits of this science. Many researchers 
are optimistic that this work will increase understanding of human behavior, improve health and well-being, 
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and reduce societal inequality. Others worry about its potential to be misused in service of racist, classist, and 
ableist claims. 


Defining the harms and benefits of research has traditionally been left to researchers, professional societies, 
and regulatory bodies. In the US, researchers are regulated by policies such as the Common Rule (45 CFR 46), 
research ethics committees, and Institutional Review Boards (IRBs). These systems of regulation guide the 
ethical conduct of research by ensuring studies have an acceptable risk-benefit profile such that potential harms 
(i.e., risks) are minimized and potential benefits enhanced. 


Confining debate about the threats and promises of social and behavioral genomics to the research community 
is limiting. Academic considerations of the harms and benefits of research, generally neglect to consider the 
broader social impacts. IRBs are expressly prohibited by the Common Rule from considering any broad social 
or policy risks. IRBs generally don’t regulate risks other than those directly encountered by research 
participants. However, per the Common Rule, IRBs are allowed to judge the broader social benefits of 
research; that is, whether research has the potential to enhance health or knowledge. As a result, existing 
mechanisms for regulating the ethical conduct of research are limited in their ability to appraise the 
downstream implications of research, especially the potential social harms. 


Daphne Martschenko, PhD (Stanford University) will present the results of an 18-month effort to gather 
input from an 11-member Community Sounding Board comprised of individuals from across the United States 
on the risks, benefits, and ethical responsibilities of social and behavioral genomics. Attendees will leave this 
presentation with tools that can help them better elicit and engage public perspectives to produce socially and 
ethically informed decisions about whether and how to conduct biomedical research, as well as socially and 
ethically responsible policy decisions and research communication. 


The presentation by Krystal Tsosie, PhD, MPH (Arizona State University) will provide an overview of how 
community data governance and stewardship with digital data tools rooted in machine learning and dynamic 
consent e-platforms have been applied to advance Indigenous Data and Digital Sovereignty. 


Mildred K. Cho, PhD (Stanford University) has conducted research regarding the integration of ethical values 
into medical AI/ML. Her most recent work examines the use of hypothetical design exercises in order to 
support ethics in the development of AI/ML applications in medicine. Carole Federico, PhD (GSK.ai- 
Stanford Ethics Fellow) will discuss ethical issues relevant to synthetic data, with a focus on representativeness 
and fairness in synthetic data and practical challenges in applying existing ethical frameworks to synthetic data. 


Machine learning predictive analytics (MLPA) are increasingly utilized in health care to reduce costs and 
improve efficacy. The growth of MLPA could be fueled by payment reforms that hold health care 
organizations responsible for providing high-quality, cost-effective care. At the same time, policy analysts, 
ethicists, and computer scientists have identified unique ethical and regulatory challenges from the use of 
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MLPA in health care, and they have also proposed a variety of principles and guidelines focused on 
confronting these challenges. 


However, critical gaps in knowledge have challenged our ability to assess these potential solutions. 
Understanding the perspectives of MLPA developers is essential for overcoming the “principles-to-practice” 
gap. Meghan Halley, MPH, PhD (Stanford University) will present a study that sought to better characterize 
available MLPA health care products, identifying and characterizing claims about products recently or 
currently in use in US health care settings that are marketed as tools to improve health care efficiency by 
improving quality of care while reducing costs. The research team conducted systematic database searches of 
relevant business news and academic research to identify MLPA products for health care efficiency meeting 
our inclusion and exclusion criteria. Their findings provide a foundational reference to inform the analysis of 
specific ethical and regulatory challenges arising from the use of MLPA to improve healthcare efficiency. 


Mildred Cho, PhD (Stanford University) has conducted research examining how developers of machine 
learning applications in healthcare envision and put values into practice in their work. Using a case study 
approach that draws from issues from the workshop presentations, Dr. Cho, Nicole Martinez-Martin, JD, 
PhD (Stanford University) and Tiffany Bright, PhD (Center for Artificial Intelligence Research Cedars-Sinai) 
will lead the audience in an interactive discussion regarding how values of diversity, representation and social 
responsibility are put into practice in the work of researchers in genomics and computational biomedicine. 
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Risk prediction: Methods, Challenges, and Opportunities 


Ruowang Li 


Department of Computational Biomedicine, Cedars-Sinai Medical Center, 
West Hollywood, California, USA 
Email: ruowang.li@cshs.org 


Rui Duan 


Department of Biostatistics, Harvard T.H. Chan School of Public Health, 
Boston, Massachusetts, USA 
Email: rduan@hsph.harvard.edu 


Lifang He 


Department of Computer Science and Engineering, Lehigh University, 
Bethlehem, Pennsylvania, USA 
Email: lih3 19@lehigh.edu 


Jason H. Moore 


Department of Computational Biomedicine, Cedars-Sinai Medical Center, 
West Hollywood, California, USA 
Email: jason.moore@csmc.edu 


1. Introduction to the workshop 


The objective of this workshop is to delve into the current and future landscape of risk prediction 
within the realm of disease and epidemiological research. Discussion topics encompass everything 
from data sources to model implementation. The workshop will feature speakers addressing 
commonly used data sources—genetics, imaging, clinical, and epidemiological data—in developing 
prediction models. Moreover, the workshop will cover model-based and post-hoc analyses, delving 
into biases, uncertainty quantification, model interpretation, fairness, diversity of prediction results, 
and the transferability and generalizability of models across different populations and datasets. The 
moderated discussion session will offer a future perspective on the validation and implementation 
of risk prediction models. The workshop will maintain a balanced focus across all stages of risk 
prediction model development and validation. By emphasizing a well-rounded workshop theme 
instead of exclusively delving into methodologies, we aim to create an environment that fosters the 
exchange of ideas and viewpoints among speakers and audiences. 


2. Workshop Presenters 


The three-hour workshop will have a total of six presentations followed by a moderated panel 
discussion session. The workshop speakers are: 


© 2023 The Authors. Open Access chapter published by World Scientific Publishing Company and 
distributed under the terms of the Creative Commons Attribution Non-Commercial (CC BY-NC) 4.0 
License. 
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Randi Foraker, PhD, is the Director of the Center for Population Health Informatics (CPHI) at the 
Institute for Informatics (12) and a Professor of Medicine within the Division of General Medical 
Sciences at Washington University in St. Louis. As director of the CPHI, she aims to improve the 
health of the community through data and support data access, analytics, and dissemination efforts. 
Her own work specializes in the design of population-based studies and the integration of electronic 
health record data with socioeconomic indicators, and her research portfolio has been supported by 
a combination of governmental and industry grants and contracts. Her most recent research has 
focused on the application of clinical decision support to complement risk scoring in primary care, 
cardiology, and oncology. Dr. Foraker also serves as Director of the Public Health Data and Training 
Center for the Institute for Public Health. As director of the Data and Training Center, she aims to 
amplify public health knowledge through data sharing, strategic partnerships with the community, 
and the training of future public health leaders. During the COVID pandemic, she has served as PI 
of the COVID umbrella IRB leveraging electronic health record data at Washington University in 
St. Louis and works closely with investigators who conduct research using data from our COVID 
Data Commons, which is maintained by I2. Dr. Foraker chairs the Epidemiology Strike Force and 
convenes members of the St. Louis City, St. Louis County, Jefferson County, Franklin County, and 
St. Charles County Departments of Public Health on a weekly basis along with academic, health 
system, and business partners to assist with their data architecture, management, and analytic needs 
during the pandemic and beyond. 


Yong Chen, PhD, is Professor of Biostatistics at University of Pennsylvania. He directs a 
Computing, Inference and Learning Lab at University of Pennsylvania 
(https://penncil.med.upenn.edu/about-pi/), which focuses on integrating fundamental principles and 
wisdoms of statistics into quantitative methods for tackling key challenges in modern biomedical 
data. Dr. Chen is an expert in synthesis of evidence from multiple data sources, including systematic 
review and meta-analysis, distributed algorithms, and data integration, with applications to 
comparative effectiveness studies, health policy, and precision medicine. He is also working on 
developing methods to deal with suboptimal data quality issues in health system data, dynamic risk 
prediction, pharmacovigilance, and personalized health management. He has over 100 publications 
in a wide spectrum of methodological and clinical areas. Dr. Chen has been principal investigator 
on a number of grants, including RO1s from the National Library of Medicine and National Institute 
of Allergy and Infectious Diseases, and Improving Methods for Conducting Patient-Centered 
Outcomes Research grant from Patient-Centered Outcomes Research Institute. Dr. Chen received 
his bachelor’s degree in Mathematics at the University of Science and Technology of China, Master 
degree in Pure Mathematics and Ph.D. in Biostatistics at the Johns Hopkins University. He is an 
elected fellow of the Society for Research Synthesis Methodology, and the International Statistical 
Institute. He is a recipient of Best Paper Award by the International Medical Informatics Association 
(IMIA) Yearbook Section on Clinical Research Informatics, Institute of Mathematical Statistics 
Travel Award, Margaret Merrell Award for excellence in research at the Johns Hopkins University, 
and Distinguished Faculty Award at the University of Pennsylvania. 
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Graciela Gonzalez-Hernandez, PhD, is Vice Chair for Research and Education in the new 
Department of Computational Biomedicine at Cedars-Sinai Medical Center. Prior to joining Cedars- 
Sinai in May 2022, Dr Gonzalez-Hernandez was an Associate Professor of Informatics in the 
Department of Biostatistics, Epidemiology and Informatics (DBEI) of the Perelman School of 
Medicine, University of Pennsylvania. She transferred her Health Language Processing (HLP) Lab 
to Cedars-Sinai, which focuses on natural language processing (NLP) and machine learning for 
knowledge discovery, extracting unstructured information from clinical records, journal articles, 
and social media postings to elucidate data patterns, trends, and relationships that can aid the 
discovery process in areas such as pharmacoepidemiology, clinical research, or public health 
monitoring and surveillance. Dr Gonzalez-Hernandez and her team have made available to the 
health research community novel approaches to complete pipelines for information extraction from 
different sources using NLP, such as the DeepADRMiner pipeline for extracting and normalizing 
adverse effects from social media — a unique end-to-end system that makes it possible to tap into 
the value of direct reports by patients. She has published over 220 peer-reviewed articles in 
prestigious journals and conferences, routinely making code and datasets available to other 
researchers, and ensuring reproducibility. These publications span multiple areas of Biomedical 
Informatics, including natural language processing, bioinformatics, biomedical ontologies, 
information retrieval, MS and machine learning, as well as domain-specific publications in 
collaboration with clinicians and epidemiologists. Her work has appeared in the top peer-reviewed 
journals, including Nature Digital Medicine, JAMA Network Open, Bioinformatics, BMC 
Bioinformatics, the Journal of the American Medical Informatics Association, and the Journal of 
Biomedical Informatics, among others, as well as in numerous informatics conference proceedings. 


Bogdan Pasaniuc, PhD, is a professor of Computational Medicine, Human Genetics and 
Pathology&Laboratory Medicine at UCLA. Dr Pasaniuc develops statistical and computational 
methods to understand the genetic basis of disease, focusing on under-represented populations, 
integrative genomics, and biobank studies. Dr Pasaniuc group developed machine learning methods 
to integrate epigenetic profiles within trans-ancestry studies to localize disease variants and genes; 
his group introduced transcriptome-wide association studies (TWAS) using predicted gene 
expression as a principled approach to identify disease genes for many traits such as Schizophrenia, 
Ovarian Cancer and Prostate Cancer. Dr Pasaniuc serves as Associate Director of Population 
Genetics of the Institute for Precision Health at UCLA that links the genetics of more than 150k 
patients with their electronic health record to predict health outcomes, to stratify patients based on 
their genetic risk to disease and to translate genomics to the clinic. Dr Pasaniuc also serves as PI for 
the Center for Admixed populations and Health Equity and for the Biomedical Data Science 
Training Program for Precision Health Equity at UCLA. 


John Witte, PhD, is serving as Vice Chair and professor in the Department of Epidemiology & 
Population Health, and as a professor of Biomedical Data Science and, by courtesy, of Genetics, he 
will also serve as a member of the Stanford Cancer Institute. Dr. Witte is an internationally 
recognized expert in genetic epidemiology. His scholarly contributions include deciphering the 
genetic and environmental basis of prostate cancer and developing widely used methods for the 
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genetic epidemiologic study of disease. His prostate cancer work has used comprehensive genome- 
wide studies of germline genetics, transcriptomics, and somatic genomics to successfully detect 
novel variants underlying the risk and aggressiveness of this common disease. A key aspect of this 
work has been distinguishing genetic factors that may drive increased prostate cancer risk and 
mortality among African American men. Providing an avenue to determine which men are more 
likely to be diagnosed with clinically relevant prostate cancer and require additional screening or 
specific treatment can help reduce disparities in disease prevalence and outcomes across 
populations. Dr. Witte has also developed novel hierarchical and polygenic risk score modeling for 
undertaking genetic epidemiology studies. These advances significantly improve our ability to 
detect disease-causing genes and to translate genetic epidemiologic findings into medical practice. 
Dr. Witte has received the Leadership Award from the International Genetic Epidemiology Society 
(highest award), and the Stephen B. Hulley Award for Excellence in Teaching. His extensive 
teaching portfolio includes a series of courses in genetic and molecular epidemiology. He has 
mentored over 50 graduate students and postdoctoral fellows, serves on the executive committees 
of multiple graduate programs, and has directed a National Institutes of Health funded post-doctoral 
training program in genetic epidemiology for over 20 years. Recently appointed to the National 
Cancer Institute Board of Scientific Counselors, Dr. Witte has been continuously supported by the 
National Institutes of Health. 


Marinka Zitnik, PhD, is an Assistant Professor at Harvard University in the Department of 
Biomedical Informatics. Dr. Zitnik is Associate Faculty at the Kempner Institute for the Study of 
Natural and Artificial Intelligence, Broad Institute of MIT and Harvard, and Harvard Data Science. 
Dr. Zitnik investigates foundations of AI to enhance scientific discovery and facilitate individualized 
diagnosis and treatment in medicine. Her algorithms and methods have had a tangible impact, which 
has garnered interests of government, academic, and industry researchers and has put new tools in 
the hands of practitioners. Some of her methods are used by major biomedical institutions, including 
Baylor College of Medicine, Karolinska Institute, Stanford Medical School, and Massachusetts 
General Hospital. 
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Statistical analysis of single-cell protein data 


Brooke L. Fridley, PhD 
Department of Biostatistics and Bioinformatics, Moffitt Cancer Center 
Tampa, FL 33612, USA 
Biostatistics and Epidemiology Core, Children’s Mercy Hospital 
Kansas City, MO 64108, USA 


Email: Brooke.Fridley@Moffitt.org; Fridley.Brooke@gqmail.com 


Simon Vandekar, PhD 
Department of Biostatistics, Vanderbilt University Medical Center 
Nashville, TN 37203, USA 
Email: Simon. Vandekar@VUMC. org 


Inna Chervoneva, PhD 
Division of Biostatistics, Thomas Jefferson University 
Philadelphia, PA 19107, USA 
Email: Inna.Chervoneva@Jefferson.edu 


Julia Wrobel, PhD 
Department of Biostatistics and Bioinformatics, Emory University 
Atlanta, GA 30322, USA 
Email: Julia.Wrobel@Emory.edu 


Siyuan Ma, PhD 
Department of Biostatistics, Vanderbilt University Medical Center 
Nashville, TN 37203, USA 
Email: Siyuan.Ma@VUMC.org 


Immune modulation is considered a hallmark of cancer initiation and progression, with immune 
cell density being consistently associated with clinical outcomes of individuals with cancer. 
Multiplex immunofluorescence (mIF) microscopy combined with automated image analysis is a 
novel and increasingly used technique that allows for the assessment and visualization of the tumor 
microenvironment (TME). Recently, application of this new technology to tissue microarrays 
(TMAs) or whole tissue sections from large cancer studies has been used to characterize different 
cell populations in the TME with enhanced reproducibility and accuracy. Generally, mIF data has 
been used to examine the presence and abundance of immune cells in the tumor and stroma 
compartments; however, this aggregate measure assumes uniform patterns of immune cells 
throughout the TME and overlooks spatial heterogeneity. Recently, the spatial contexture of the 
TME has been explored with a variety of statistical methods. In this PSB workshop, speakers will 
present some of the state-of-the-art statistical methods for assessing the TIME from mIF data. 


Keywords: spatial biology, multiplex immunofluorescence, single-cell protein, tumor 
microenvironment, biostatistical analysis, spatial analysis 


© 2023 The Authors. Open Access chapter published by World Scientific Publishing Company and 
distributed under the terms of the Creative Commons Attribution Non-Commercial (CC BY-NC) 4.0 
License. 


654 


Pacific Symposium on Biocomputing 2024 


1. Introduction, Background and Motivation 


The treatment of cancers has been revolutionized in recent years with the advent of 
immunotherapies'®. However, not all patients respond to immunotherapies and a subset of patients 
that initially respond to immunotherapy go on to develop resistance. To understand why some 
patients do not respond to immunotherapies, much research has been devoted to understanding the 
role of the immune contexture of the tumor immune microenvironment (TIME) and its association 
with clinical outcomes*’””’. Thus, immune profiling using a variety of approaches has become an 
important part of immuno-oncology. 

Some commonly used approaches for studying the tumor immune microenvironment include 
(but are not limited to): flow cytometry!°, imaging mass cytometry'!, immunohistochemistry 
(IHC)'*, immune cell devolution of bulk RNA-seq data'?, single-cell RNA-seq!*, spatial 
transcriptomics!> and multiplex immunofluorescence (mIF)!*. Multiplex immunofluorescence 
microscopy combined with automated image analysis is a novel and increasingly used technique 
that allows for the assessment and visualization of the TME. This technology has been applied to 
a variety of sample types, from whole slide images to regions of interest (ROIs)'’ and tissue 
microarrays (TMAs)!®!°, 

As with any new technology, there are inevitability challenges with the statistical analysis of 
the single-cell imaging data”°?!. Some of the challenges come from cell phenotyping, which is 
labeling cells as positive or negative for each antibody of interest. This is a necessary preprocessing 
step that occurs before spatial data analysis that is critical for accurately estimating immune cell 
abundance in the TIME. After phenotyping, it is typical to measure immune cell abundance, 
typically calculated as percent or proportions of specific cell types in the tumor compartment of 
the tissue. A challenge of this task is that many cell types are often observed at low-abundance 
(i.e., zero-inflated), particularly in low immune infiltrated tumors (e.g., immune “cold” tumors). 

Besides the protein markers used for phenotyping cells, it is often of important to quantify the 
actual levels of proteins of interest in all or some cell types. Such quantitative functional markers 
may include proliferation markers (e.g., Ki-67, PCNA), checkpoint proteins (e.g., PD-1, PD-L1, 
CTLA-4) and growth factors and receptors (e.g., EGFR, HER2). Traditionally, a single mean 
expression level across the cells of interest is computed and considered as a biomarker. This 
approach ignores important tumor heterogeneity and has low sensitivity for detecting high 
expression in some portion but not all cells of interest. Alternative approaches have been recently 
developed”, using the entire distributions of single-cell protein expression levels in a tumor 
tissue to derive quantitative functional markers. 

Finally, there is growing evidence that the spatial architecture of the TIME has high impact on 
disease progression and response to immunotherapy. Generally, mIF data has been used to 
examine the presence and abundance of immune cells in the TIME; however, this aggregate 
measure assumes uniform patterns of immune cells throughout the tumor and overlooks spatial 
heterogeneity. Recently, the spatial contexture of the TIME has been explored with a variety of 
spatial statistical methods, including those for assessing co-localization. In this session, speakers 
will present some of the state-of-the-art statistical methods for assessing the TIME from mIF data. 
All slides and R code presented during the workshop can be found at 
http://juliawrobel.com/PSB_scProteomics . 
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2. Speaker Abstracts 


Overview of abundance-based and spatial-based analysis approaches for multiplex imaging 
data 
Brooke L Fridley 


With the advent of immunotherapies for the treatment of cancer, much research is being conducted 
to understand the tumor immune microenvironment (TIME). To date, much of the research 
completed has focused on understanding the abundance of different immune cell subsets in the 
TIME using either single-cell RNA-seq or multiplex immunofluorescence (mIF). One benefit in 
using mIF based technologies is that, in addition to abundance of immune cells, one is also able to 
get the spatial location of these cells within the TIME. Thus, researchers can answer question that 
relate to the spatial architecture or contexture of the TIME and how this might impact clinical 
outcomes. In this presentation, we provide an overview of how mIF data is generated and analysis 
methods used for assessing the non-spatial aspects of the TIME (i.e., abundance level analyses). 
After providing an overview of mIF data and abundance-based analysis approaches, we will 
review a variety of spatial statistical approaches for analyzing the spatial contexture. To facilitate 
spatial analyses, we will also present on an R package, spatialTIME, developed to generate these 
spatial statistics on large sets of samples!7**. 


Normalization and Cell Phenotyping for mIF data 
Simon Vandekar 


Normalization and cell phenotyping are critical steps in the multiplexed image analysis pipeline 
prior to performing downstream statistical analysis because they remove batch effects and identify 
consistent cell types across slides. These analysis steps are particularly challenging for mIF data 
due to the unique heterogeneity of the image intensities across slides and overlapping cell 
distributions. We review some recently proposed normalization methods *>° and discuss the three 
main procedures for cell phenotyping (marker gating, unsupervised clustering, and supervised 
algorithms), in the context of mIF imaging”, including our recently developed semi-supervised 
algorithm, GammaGateR. The R package GammaGateR focuses on efficiently estimating the 
marginal distributions of single-cell marker intensities using a novel closed-form Gamma mixture 
model to identify marker positive cells. It incorporates biological constraints to improve 
consistency across a large number of slides and allows users to interactively curate the model fit. 
We compare several cell phenotyping algorithms developed for multiplexed imaging and 
demonstrate how to use the results to perform spatial analyses of mIF imaging data. 


Quantile biomarkers based on single-cell multiplex immunofluorescence imaging data 
Inna Chervoneva 


Modern pathology platforms for multiplex fluorescence-based immunohistochemistry provide 
distributions of cellular signal intensity (CSI) levels of proteins across the entire cell populations 
within the sampled tumor tissue. However, heterogeneity of CSI levels is usually ignored, and the 
simple mean signal intensity (MSI) value is considered as a cancer biomarker. To account for 
tumor heterogeneity, we consider the entire CSI distribution as a predictor of clinical outcome. 
This allows retaining all quantitative information at the single-cell level by considering the values 
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of the quantile function (inverse of the cumulative distribution function) estimated from a sample 
of CSI levels in a tumor tissue. 

A simple and intuitive approach is to select an optimal quantile of the CSI distribution as the 
best predictor of clinical outcome of interest. In Yi et al (2023), we developed an algorithm, 
implemented in the R package Qindex, for selecting optimal CSI distribution quantiles as best 
predictors of outcome. The proposed algorithm was used to select optimal quantile biomarkers of 
progression-free survival in a large cohort of breast cancer patients and validated in an independent 
external validation cohort. The optimal quantile protein biomarkers yielded generally improved 
prognostic value as compared to the standard MSI biomarkers. 

A more comprehensive approach is to derive new biomarkers as single-index predictors based 
on the entire CSI distribution summarized as a quantile function. *” The proposed Quantile Index 
(QI) biomarker is defined as a linear or nonlinear functional regression predictor of outcome. The 
linear functional regression quantile Index (FR-QJ) is the integral of subject-specific CSI quantile 
function multiplied by the common weight function’. The nonlinear functional regression quantile 
index (nFR-QI) is computed as the integral of unspecified bivariate twice differentiable function 
with probability p and subject-specific quantile function as arguments. The weight and nonlinear 
bivariate function are represented by penalized splines and estimated by fitting suitable functional 
regression models to a clinical outcome. The proposed QI biomarkers were derived for proteins 
expressed in cancer cells of malignant breast tumors and compared to the standard MSI predictors 
and optimal quantile protein biomarkers”. The R package Qindex implements the optimization of 
QI biomarkers and their evaluation in an independent test set. 


Tools and software for functional data analysis of multiplexed imaging data 
Julia Wrobel 


The TME, which characterizes the tumor and its surroundings, plays a critical role in 
understanding cancer development and progression. Recent advances in imaging techniques 
enable researchers to study spatial structure of the TME at a single-cell level. Many popular 
approaches for analyzing spatial relationships between cell types or quantifying spatial co- 
expression of biological markers in multiplex imaging data are based on point process theory. The 
location of cells in mIF data are treated as following a point process, realizations of a point process 
are called “point patterns”, and point process models seek to understand correlations in the spatial 
distributions of cells. Under the assumption that the rate of a cell is constant over an entire region 
of interest a point pattern will exhibit complete spatial randomness (CSR), and it is often of interest 
to model whether cells deviate from CSR either through clustering or repulsion. 

Spatial summary functions characterize the degree of spatial interaction among cells across 
different radii, however, these are often evaluated at a single arbitrarily chosen cellular distance. 
Using techniques from functional data analysis, we introduce an approach to model the association 
between these summary spatial functions and patient-level survival outcomes across all radii 
simultaneously, while controlling for other clinical scalar predictors such as age and disease stage. 
In addition, we introduce a novel hypothesis test to what level of model flexibility is most 
appropriate for a given multiplex imaging dataset. Finally, our methods are implemented in mxfda, 
a general-purpose R package for functional data analysis of multiplex imaging data. 
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A Flexible Generalized Linear Mixed Effects Model for Testing Cell-Cell Colocalization in 
Spatial Immunofluorescent Data 
Siyuan Ma 


mIF data analysis is interested in characterizing the nuanced spatial context of tissue 
microenvironments, such as the infiltration or exclusion of certain immune cell populations in 
tumor tissues. To test for cell colocalization or exclusion events, existing methods often rely on 
image-wide statistics to create null distributions for cell colocalization events and evaluate their 
statistical significance**. Given that tissue characteristics can be image-specific (i.e., size of 
images, the local topology of tissue organization), this type of approach does not generalize well 
for comparisons between images/conditions. We show that, by examining cell colocalization 
events on a per-cell basis, they can be modeled with common count-based distributions such as 
the binomial. As such, cell colocalization or exclusion can be practically analyzed with generalized 
linear mixed effects models with spatially correlated error terms. This allows flexible inclusion 
and testing of image/condition effects and subject-specific correlations, because they can be easily 
modeled as fixed or random regression effects. We demonstrate that this model relies on essentially 
the same assumptions as existing image-wide modeling approaches. In practice, it can be 
implemented with the readily available R package spaMM. We exemplify the utility of such a 
model with an application in protein immunofluorescent imaging of inflammatory bowel disease 
tissues”. 
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Cells consist of large components, such as organelles, that recursively factor into smaller sys- 
tems, such as condensates and protein complexes, forming a dynamic multi-scale structure 
of the cell. Recent technological innovations have paved the way for systematic interrogation 
of subcellular structures, yielding unprecedented insights into their roles and interactions. 
In this workshop, we discuss progress, challenges, and collaboration to marshal various 
computational approaches toward assembling an integrated structural map of the human 
cell. 


Keywords: cell mapping, subcellular structures, computational modeling of cell 


1. Overview 


A fundamental objective of cell biology is to decode the intricate multi-scale structures within 
cells, ranging from macroscopic organelles to microscopic condensates and protein complexes. 
This goal necessitates a comprehensive understanding of the spatial and functional organiza- 
tions of subcellular components, particularly within the context of cell function and diseases. 
In recent years, a plethora of advanced technologies have emerged, enabling systematic in- 
terrogation of subcellular structures and providing unprecedented insights into their functional 
significance. For example, immunofluorescence imaging! ® facilitates the real-time visualiza- 
tion of static and dynamic subcellular interactions at high resolution. Similarly, cryo-electron 
tomography*” and microscopy? capture intricate structural details of subcellular components 
in their native, hydrated state, thus preserving their functional context. On the biochemical 
front, affinity purification,!?"! co-elution,!? and crosslinking mass spectrometry!*:4 techniques 
have provided avenues for elucidating the complex networks of protein interactions within cells. 
The emerging machine learning pipelines!® '§ associated with these technologies have further 
augmented the systematic interpretation of cell architecture and association with diseases. 


© 2023 The Authors. Open Access chapter published by World Scientific Publishing Company 
and distributed under the terms of the Creative Commons Attribution Non-Commercial (CC BY- 
NC) 4.0 License. 
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The integration of these complementary technologies represents a promising avenue for 
mapping the architecture of cells across a broad range of scales. Using two of these techniques, 
protein imaging and affinity purification, the session organizers have recently published a novel 
framework, called MuSIC (Multi-Scale Integrated Cell),'® for assembling hierarchical maps 
of human subcellular components spanning the multiple scales of cell biology. The timely 
and distinct opportunity that emerges from this work is to assemble a key group of thought 
leaders in a suitable location to discuss progress, open challenges, and, most importantly, 
how collaborative teams can be established to marshal the various technologies toward an 
integrated structural map of the human cell. 

Hence, this workshop, “Tools for assembling the cell: Towards the era of cell structural 
bioinformatics,” aims to be a catalyst for scientific discourse and collaboration, providing a 
platform for eminent professionals from varying domains to explore and strategize the fu- 
ture of subcellular structure mapping. We invited seven distinguished speakers (Drs. David 
Baker, Markus Covert, Jan Ellenberg, Rachel Karchin, Tychele Turner, Aubrey Weigel, and 
Marinka Zitnik) to share insights on data acquisition and computational approaches for cellu- 
lar modeling. This workshop is designed to provide attendees with a deep dive into the present 
technological innovations and highlight avenues for potential collaboration and exploration. 


2. Navigating the Workshop 


Developing a spatiotemporal map of the cell necessitates integrating various sources of data 
into a single model. To enable communication and synergy between experimental scientists and 
computational modelers, this workshop features incisive talks from seven experts in procuring 
spatiotemporal biological data and advancing computational modeling of cellular architecture 
across multiple scales. 

Dr. David Baker is a Henrietta and Aubrey Davis Endowed Professor of Biochemistry 
at University of Washington, Director of Institute for Protein Design and an Investigator at 
Howard Hughes Medical Institute. His research focuses on developing protein design software 
and using it to create molecules that solve challenges in medicine, technology and sustainabil- 
ity. His group developed the Rosetta algorithm for ab initio protein structure prediction.!®° 
Most recently, his group has developed RoseTTAFold, a three-track network to process se- 
quence, distance, and coordinate information simultaneously, and achieved more accurate 
protein structure prediction.?! 

Dr. Markus Covert, a Professor of Bioengineering and, by courtesy, of Chemical and 
Systems Biology at Stanford University, focuses on building computational models of complex 
biological processes and using these models to guide an experimental program. His lab pio- 
neered the “whole-cell” model encoding all known information about each gene and molecule 
to predict cell behaviors.?? His lab has also made significant contributions to live-cell imaging 
of immune signaling, including a game-changing method to analyze microscopy images using 
deep learning!’ and a technique that traces cellular behavior from the initial stimulus, through 
the signaling pathways, down to genome-wide changes in gene expression, within the single 
cell.?3 


Dr. Jan Ellenberg is Head of Cell Biology and Biophysics, and Head of the European 
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Molecular Biology Laboratory (EMBL) Imaging Center, at EMBL Heidelberg. He has devel- 
oped state-of-the-art quantitative fluorescence-based imaging techniques,” and combined these 
technologies with subsequent automation and analysis platforms.* His lab leveraged these four- 
dimensional imaging approaches to enable characterization of processes within human cells, 
such as protein localization during cell division? and nuclear pore complex assembly.”° 

Dr. Rachel Karchin is a professor at Johns Hopkins University and the Institute for 
Computational Medicine. She has made significant contributions to the field of cancer genomics 
by leveraging 3D protein structure for variant interpretation developing tools to detect somatic 
mutation hotspot regions in 3D protein structures.!””° Similarly, Dr. Tychele Turner, an 
Assistant Professor at Washington University in St. Louis, worked on precision genomics in 
neurodevelopmental disorders, determining all possible relevant variations within an individual 
to the precise nucleotide.?’ Together, both of them focused on mapping mutations in 3D and 
aimed to compare the 3D mutation clusters between neurodevelopmental diseases and cancers, 
bringing new insight into genomics research. 

Dr. Aubrey Weigel is a Project Scientist of the Cellular Organelle Segmentation in 
Electron Microscopy (COSEM) Project Team at Howard Hughes Medical Institute (HHMI) - 
Janelia Research Campus. She has pioneered a pipeline that combines focused ion beam scan- 
ning electron microscopy (FIB-SEM) with deep learning annotation methods to reconstruct 
maps of entire cells at 4-8 nm resolution.®? Such data and models are available to the scientific 
community through an open-sourced platform, called OpenOrganelle. These data acquisition 
and analysis techniques can provide insight into complicated cellular processes, and similar 
analyses revealed the dynamics of endoplasmic reticulum (ER)-to-Golgi protein delivery.?% 

Dr. Marinka Zitnik is an Assistant Professor at Harvard Medical School, and affil- 
iated with several Harvard-based institutes. She investigates the foundations of AI to en- 
hance scientific discovery and to realize individualized diagnosis and treatment. She proposed 
Decagon, a graph-convolution-network-based model to model polypharmacy side effects.!* She 
also founded Therapeutics Data Commons (TDC), an initiative to access and evaluate AI ca- 
pability across therapeutic modalities and stages of discovery. Their aim is to establish which 
AI methods are most suitable for advancing therapeutic science and why these techniques are 
advantageous.290 


3. Discussion and Implications 


In this workshop, we delve into cutting-edge technologies designed to illuminate the spatial 
and functional organizations of subcellular components. Drs. David Baker, Markus Covert, Jan 
Ellenberg, Rachel Karchin, Tychele Turner, Aubrey Weigel, and Marinka Zitnik are the distin- 
guished speakers contributing their extensive knowledge to this workshop. They elucidate the 
advancements in data acquisition, sophisticated analysis techniques, and computational tools 
essential for the assembly of human subcellular components at various scales. This workshop 
provides a platform not only as a repository of knowledge but also as a forum for academic 
exchange. Scientists are welcome to discuss the promises, pitfalls, and challenges of modeling 
the subcellular structures. In addition, the insights of the distinguished speakers can foster 
the promise of interdisciplinary projects using cell mapping techniques, encouraging potential 
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llaborations to drive cell structural biology further. 
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ERRATUM 


How Fitbit data are being made available to registered researchers in All of Us 
Research Program 


Hiral Master, Aymone Kouame, Kayla Marginean, Melissa Basford, Paul Harris 
Vanderbilt University Medical Center Nashville TN, USA 
Email: hiral.master@vumc.org,aymone.kouame@vumc.org, kayla.marginean@vumc.org, 
melissa.basford@vumc.org, paul.a.harris@vumc.org 


Michelle Holko 
Google Public Sector Washington DC, USA 
Email: michelle.holko@gmail.com 


In the above PSB article published in Biocomputing 2023: Proceedings of the Pacific Symposium, pp. 
19-30; PMCID: PMC9811842; PMID: 36540961 
The following correction has been made. 


©2023 The Authors. Open Access chapter published by World Scientific Publishing Company 
and distributed under the termsof the Creative Commons Attribution Non-Commercial (CCBY- 
NC) 4.0License. 
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Author Correction 


In the version of this PSB 2023 conference full-length paper, the authors had mistakenly shown counts in 
the figures such that counts <20 for some categories could be derived using mathematical formula. The 
authors apologize for this unintentional error. Therefore, authors have updated figures 4a, 4b and 4d to 
ensure exact participants counts for categories <20 cannot be derived using mathematical formula. This 
update was done to ensure participants’ privacy and follow All of Us Data and Statistics Dissemination 
Policy. 


da Ethnicity 


Skip or Not Specified 


94 (4.63%) 
È None Of These 
2 5 (0.66%) 
$£ 
u 
nl 
U 
£ 
Q 
a 
g 
g Hispanic Or Latino 29 (6.46%) 
Not Hispanic 
Or Latino 
4b 
Race 
Not 
Specified 
Skip 
None Of These 4 
w 
v 
£ More Than 
3 One Population 
= 
a Native Hawaiian 
& & Other Pacific Islander 
= OR Middle Eastern 
a & North African 


Asian 


Black 


White 
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4d 


Age at CDR 


80-90+ 234 (1.82%) 


70-79 ,786 (13.92%) 


60-69 


50-59 


Age Group at CDR 


40-49 2,013 (15.69%) 


30-39 


18-29 1,079 (8.41%) 


Fig. 4. Self-reported a) ethnicity, b) race, d) age of participants with Fitbit data in June 2022 curated data 
repository, which can be accessed by registered users on Researcher Workbench. 

Figures have been updated to ensure exact participants counts for categories <20 cannot be derived using 
mathematical formula as per the All of Us Data and Statistics Dissemination Policy. 
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